Rice Production

Author

Rin Hwang

Introduction

For this project, I analyzed a rice paddies agricultural dataset that contains 2,790 instances and 45 features related to rice cultivation. Key variables include Hectares, Variety, Soil Types, Seed Rate, Manure (LP_nurseryarea), Fertilizer (DAP_20days), and Yield.

The main research question I am investigating is:

“To what extent do early-season irrigation (30DAI), peak-growth rainfall (71_105DRain), and nursery type (dry or wet) collectively predict rice yield across different soil types?”

Key variables I will be using are:

  • “Soil Types”: A categorical variable distinguishing the type of soil used which is either alluvial or clay. Alluvial soil is composed of fine-grained sediments that offers high fertility and efficient moisture retention for rice cultivation. In contrast, clay soil consists of extremely fine mineral particles that pack tightly together, creating a heavy, nutrient-rich texture that holds water effectively but can become difficult to manage when dry.

  • “Nursery”: A categorical variable that identifies the land preparation method used for the seedlings, classified as either a dry bed or a wet environment.

  • “30DAI(in mm)”: A quantitative variable for the total amount of controlled irrigation water applied to the fields during the first 30 days after initiation.

  • “71_105DRain(in mm)”: A quantitative variable that measures the volume of natural rainfall received during the mid-to-late growth phase, specifically between 71 and 105 days of the crop cycle.

  • “Paddy yield(in Kg)”: A quantitative variable that represents the total weight of harvested rice produced from each observed field plot. This variable serves as the primary outcome variable.

The source of this data comes from the University of California, Irvine Machine Learning Repository where it was donated by researchers, Dr. Muthukumaran Subramaniyan, Dr. John Peter K, Dr. Dilipkumar E., Dr. Savithri S., and Senbagam K., from their paper, “A Hybrid Machine Learning Model with Combined Wrapper Feature Selection Techniques to Improve the Yield of Paddy” that was published to the International Journal of Electronics and Communication Engineering in 2023.

To clean the raw data for statistical analysis, I used tidyverse library to convert character strings into explicit factored categories via as.factor(). To clean and standardize the variables/columns, I used the gsub() function to strip out all parenthetical units, and then utilized a second time to substitute spaces and hyphens with clean underscores before converting the final strings to lowercase.

Regarding data collection methodology, the repository does not provide a ReadMe file or documentation detailing the physical data collection process. However the repository does tell objectives and methods of the data, stating that driven by the critical need to address global food security for a rising population, this study utilizes machine learning to identify and predict the key environmental and agronomic factors that dictate paddy growth. In order to counter high computational costs, the researchers developed a Hybrid Machine Learning Model with Combined Wrapper Feature Selection (HMLCWFS) that integrates feature selection techniques, including Backward Elimination and Gradient Boosting, and merges the results using Poincaré’s formula.

I chose this topic because I wanted to see how agricultural data can help address the critical global challenge of food insecurity. I am also always interested in choosing completely new unfamiliar subjects for my projects whenever I can, so I can keep learning new things and and challenge my analytic skills.

Uploading Libraries and Dataset

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
Warning: package 'readr' was built under R version 4.5.3
Warning: package 'dplyr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
setwd("C:/Users/hwang/OneDrive/Documents/MC stuff/Spring 2026/DATA 110 Data Visualization and Communication/Projects/Final Project")
ricedata <- read_csv("paddydataset.csv")
Rows: 2789 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): Agriblock, Variety, Soil Types, Nursery, Wind Direction_D1_D30, Wi...
dbl (37): Hectares, Seedrate(in Kg), LP_Mainfield(in Tonnes), Nursery area (...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Retrieving List of Variables

names(ricedata)
 [1] "Hectares"                           "Agriblock"                         
 [3] "Variety"                            "Soil Types"                        
 [5] "Seedrate(in Kg)"                    "LP_Mainfield(in Tonnes)"           
 [7] "Nursery"                            "Nursery area (Cents)"              
 [9] "LP_nurseryarea(in Tonnes)"          "DAP_20days"                        
[11] "Weed28D_thiobencarb"                "Urea_40Days"                       
[13] "Potassh_50Days"                     "Micronutrients_70Days"             
[15] "Pest_60Day(in ml)"                  "30DRain( in mm)"                   
[17] "30DAI(in mm)"                       "30_50DRain( in mm)"                
[19] "30_50DAI(in mm)"                    "51_70DRain(in mm)"                 
[21] "51_70AI(in mm)"                     "71_105DRain(in mm)"                
[23] "71_105DAI(in mm)"                   "Min temp_D1_D30"                   
[25] "Max temp_D1_D30"                    "Min temp_D31_D60"                  
[27] "Max temp_D31_D60"                   "Min temp_D61_D90"                  
[29] "Max temp_D61_D90"                   "Min temp_D91_D120"                 
[31] "Max temp_D91_D120"                  "Inst Wind Speed_D1_D30(in Knots)"  
[33] "Inst Wind Speed_D31_D60(in Knots)"  "Inst Wind Speed_D61_D90(in Knots)" 
[35] "Inst Wind Speed_D91_D120(in Knots)" "Wind Direction_D1_D30"             
[37] "Wind Direction_D31_D60"             "Wind Direction_D61_D90"            
[39] "Wind Direction_D91_D120"            "Relative Humidity_D1_D30"          
[41] "Relative Humidity_D31_D60"          "Relative Humidity_D61_D90"         
[43] "Relative Humidity_D91_D120"         "Trash(in bundles)"                 
[45] "Paddy yield(in Kg)"                

Cleaning the Variables

# Remove parentheses and everything inside them
names(ricedata) <- gsub("\\s*\\(.*?\\)", "", names(ricedata))

# Replace spaces with underscores
names(ricedata) <- gsub("[ /-]", "_", names(ricedata))

# Make all variable characters lowercase
names(ricedata) <- tolower(names(ricedata))

names(ricedata)
 [1] "hectares"                   "agriblock"                 
 [3] "variety"                    "soil_types"                
 [5] "seedrate"                   "lp_mainfield"              
 [7] "nursery"                    "nursery_area"              
 [9] "lp_nurseryarea"             "dap_20days"                
[11] "weed28d_thiobencarb"        "urea_40days"               
[13] "potassh_50days"             "micronutrients_70days"     
[15] "pest_60day"                 "30drain"                   
[17] "30dai"                      "30_50drain"                
[19] "30_50dai"                   "51_70drain"                
[21] "51_70ai"                    "71_105drain"               
[23] "71_105dai"                  "min_temp_d1_d30"           
[25] "max_temp_d1_d30"            "min_temp_d31_d60"          
[27] "max_temp_d31_d60"           "min_temp_d61_d90"          
[29] "max_temp_d61_d90"           "min_temp_d91_d120"         
[31] "max_temp_d91_d120"          "inst_wind_speed_d1_d30"    
[33] "inst_wind_speed_d31_d60"    "inst_wind_speed_d61_d90"   
[35] "inst_wind_speed_d91_d120"   "wind_direction_d1_d30"     
[37] "wind_direction_d31_d60"     "wind_direction_d61_d90"    
[39] "wind_direction_d91_d120"    "relative_humidity_d1_d30"  
[41] "relative_humidity_d31_d60"  "relative_humidity_d61_d90" 
[43] "relative_humidity_d91_d120" "trash"                     
[45] "paddy_yield"               

Cleaning the Dataset

ricedata_clean <- ricedata %>%
  filter(paddy_yield > 0, hectares > 0) %>%
  mutate(
    nursery = as.factor(nursery),
    soil_types = as.factor(soil_types), 
    `71_105drain` = as.numeric(`71_105drain`))%>%
    select(paddy_yield, '30dai', '71_105drain', nursery, soil_types, hectares) %>%
  arrange(desc(paddy_yield))

colSums(is.na(ricedata_clean))
paddy_yield       30dai 71_105drain     nursery  soil_types    hectares 
          0           0           0           0           0           0 

Preliminary Visualization

Yield by Rainfall and Soil Type

ggplot(ricedata_clean, aes(x = as.factor(`71_105drain`), y = paddy_yield, fill = soil_types)) +
  geom_boxplot(alpha = 0.7, outlier.size = 1) +
  facet_wrap(~nursery) +
  scale_fill_manual(values = c("alluvial" = "#4E79A7", "clay" = "#F28E2B")) +
  theme_classic(base_family = "serif") +
  labs(
    title = "Paddy Yield by Rainfall and Soil Type",
    subtitle = "Yield distributions across recorded rainfall levels",
    x = "Rainfall (71-105 Days) in mm",
    y = "Total Yield",
    fill = "Soil Category"
  ) +
  theme(
    legend.position = "bottom",
    strip.background = element_blank(),
    strip.text = element_text(face = "bold", size = 12),
    axis.text.x = element_text(angle = 0) 
  )

This visualization suggests that while alluvial soil provides more reliable and consistent yields in Wet Nursery environments, clay soil is potentially superior in dry nursery environments at lower peak-rainfall levels. The high degree of overlap between the boxes indicates that while soil and nursery choices matter, there is significant variance in yield that the regression model will need to account for using the other variables (like 30DAI).

Multiple Linear Regression Model

paddy_model <- lm(paddy_yield ~ `30dai` + `71_105drain` + nursery + soil_types, data = ricedata_clean)

model_summary <- summary(paddy_model)
print(model_summary)

Call:
lm(formula = paddy_yield ~ `30dai` + `71_105drain` + nursery + 
    soil_types, data = ricedata_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-17402  -6068   2062   8456  16242 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)    -46133.423  69369.768  -0.665    0.506
`30dai`           640.520    628.441   1.019    0.308
`71_105drain`     909.733    944.608   0.963    0.336
nurserywet        155.559    350.740   0.444    0.657
soil_typesclay     -3.641    350.306  -0.010    0.992

Residual standard error: 9204 on 2784 degrees of freedom
Multiple R-squared:  0.0004432, Adjusted R-squared:  -0.0009929 
F-statistic: 0.3086 on 4 and 2784 DF,  p-value: 0.8724

Equation

Adjusted R-squared (-0.0009929): This value is very low and it indicates that the model explains 0% of the variability in paddy yield. Thus it suggests that early irrigation, late rainfall, soil matrix types, and basic nursery configurations are completely insufficient for predicting the size of a harvest on their own.

F-statistic P-value (0.8724): Because the p-value is well over the standard threshold of 0.05, the model is not statistically significant and fails to predict paddy yield any better than a random guess.

par(mfrow = c(2, 2))
plot(paddy_model, col = "#4E79A7")

Yield Success Across Soil Type and Rainfall

hc_data <- ricedata_clean
hchart(
  hc_data, 
  "bubble", 
  hcaes(x = `30dai`, y = paddy_yield, size = `71_105drain`, group = soil_types),
  maxSize = "5%", 
  alpha = 0.6
) %>%
  hc_colors(c("#5DADE2", "#EB984E")) %>% 
  hc_plotOptions(
    bubble = list(
      lineWidth = 0,
      marker = list(fillOpacity = 0.6)
    )
  ) %>%
  hc_title(text = "Paddy Yield vs. Early Irrigation") %>%
  hc_subtitle(text = "Distribution of yield across soil types and rainfall") %>%
  hc_xAxis(title = list(text = "Irrigation (30DAI) mm")) %>%
  hc_yAxis(
    title = list(text = "Total Yield (Kg)"),
    gridLineWidth = 0 # Removes horizontal grid lines for a minimalist look
  ) %>%
  hc_legend(enabled = TRUE) %>%
  hc_tooltip(pointFormat = "Yield: {point.y} kg<br>Rain: {point.size} mm") %>%
  hc_add_theme(hc_theme_smpl()) %>% 
  hc_credits(enabled = TRUE, text = "Source: UCI Machine Learning Repository")

This interactive bubble chart illustrates the relationship between early-season irrigation (30DAI) and Total Paddy Yield, while simultaneously showing peak-growth rainfall through bubble size and soil category through color. The visualization reveals that the irrigation data is highly clustered at three specific intervals: approximately 20.3mm, 21.5mm, and 21.9mm. This created vertical columns of data rather than a continuous trend. Despite these distinct levels of irrigation, the yields remain broadly distributed between 5,000kg and 40,000kg across all groups, suggesting that irrigation volume within this narrow range is not a primary determinant of harvest success. Furthermore, the heavy overlap between the orange (clay) and blue (alluvial) bubbles indicates that soil type does not create a significant difference in yield performance under these specific water conditions. Ultimately, the chart visually supports the findings of the multiple linear regression, demonstrating that these environmental variables possess low predictive power for paddy productivity in this dataset.

Distribution of Paddy Yield by Seed Variety and Soil Types

The variable “variety” refers to the type of rice/seed used. In this dataset, there are 3 that are used.

  • Delux Ponni: This is a high-quality, premium variant of traditional Ponni rice that is known for its superior grain texture and high market value. However, its crop yield is highly sensitive to soil conditions compared to other varieties.

  • CO_43: This is a chemically mutated rice variety developed by Tamil Nadu Agricultural University that thrives in harsh, alkaline, or clay-heavy soils where traditional varieties like Delux Ponni struggle to perform.

  • Ponmani: This is a high-yielding, long-duration cultivar known for its bold grains and excellent tolerance to waterlogging and flooding, making it structurally sturdier in heavy rainfall conditions than both CO_43 and Delux Ponni.

            <script type='text/javascript'>                    var divElement = document.getElementById('viz1778889161272');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.minWidth='420px';vizElement.style.maxWidth='650px';vizElement.style.width='100%';vizElement.style.minHeight='587px';vizElement.style.maxHeight='887px';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='727px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

This Tableau treemap illustrates average paddy yield using a red-to-blue color gradient, distributed by seed variety and nested within alluvial and clay soil types. I noticed a striking pattern with the delux ponni variety as it is the top performer in alluvial soil, achieving the dataset’s maximum average yield of 22,956 kg (dark blue), yet it drops drastically into the lower-performing red zone when grown in clay soil. Conversely, clay soil yields remain stable and bounded across all varieties, even allowing the CO_43 variety to thrive at 22,730 kg despite underperforming in alluvial environments.

A primary limitation of this visualization is its inability to cleanly layer in continuous environmental data, such as early-season irrigation (30dai) or peak rainfall (71_105drain), without making the treemap messy and would be unreadable. Additionally, while rectangle sizes reflect record volume, the chart lacks explicit labels for exact plot counts (n), leaving it unknown in the chart whether a few outlier farms skewed the averages. I wished to have included a map chart with these varieties with the agriblock locations as it would have provided a clearer spatial understanding of which regional farm blocks naturally favor specific seed types.

References

Subramaniyan, M. (2023). Paddy Dataset [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C55W3J.

Tamil Nadu Agricultural University. (2024). Tamil Nadu Varieties. Tamil Nadu Agricultural University. http://www.agritech.tnau.ac.in/expert_system/paddy/TNvarieties.html

People harvesting rice photo: https://www.bbc.com/news/business-66323991