Abstract

The project aims to utilize a particular crop related dataset (Paddy / Rice Dataset from UC Irvene Machine Learning repository utilized in current case), which contains multiple agronomic, environmental, and crop‑related features, for the purpose of dimension reduction. Modern agricultural research increasingly relies on large, feature‑rich datasets to understand crop performance, optimize cultivation practices, and support data‑driven decision‑making. As farming conditions, climate patterns, and crop varieties evolve, the volume and complexity of agricultural yield continues to grow which is an expected practice.

For current project, we utilize the full Paddy Dataset because all feature groups—soil characteristics, climate variables, crop breed or traits, and management practices—contribute to understanding paddy or rice growth patterns. Small variations across a few selective features can significantly influence yield, making dimension reduction a valuable tool for uncovering underlying structure in the dataset. And, accordingly the results can be utilized to harness parameters which influence paddy production volume the most for real world cultivation suggestions.

Keywords : Paddy Dataset, Agriculture, Crop Features, Dimensionality Reduction, PCA, MCA, UMAP, Isomap

Installing Necessary R Libraries or Packages

library(tidyverse); # library utilized for ggplot2 and other functions
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2

Dataset description

Data Set Source - UC Irvine Machine Learning Repository

Data Set Name - Paddy Dataset

Link to Data Set

Details

  • Recency of Data Set can be determined from date of upload - 14 July 2025.
  • In-total the data set has 45 Features or columns dedicated to cultivation results or produce of closely coupled farms in a particular region.
  • The data set includes agronomic, environmental, and crop‑specific variables such as soil type, rainfall, temperature, plant characteristics, and recommended paddy varieties.
  • Many features are redundant or derived from others making the data set ideal for dimension‑reduction techniques such as PCA, UMAP, and Isomap.

Data Import and Cleanup

Importing Dataset

Viewing dataset features and data header rows (top 5) for reference. Also, utilizing summary and str functions to notice data structure.

df <- read_csv("./paddydataset.csv");
## Rows: 2789 Columns: 45
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Agriblock, Variety, Soil Types, Nursery, Wind Direction_D1_D30, Wi...
## dbl (37): Hectares, Seedrate(in Kg), LP_Mainfield(in Tonnes), Nursery area (...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df);
## spc_tbl_ [2,789 × 45] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Hectares                          : num [1:2789] 6 6 6 6 6 6 6 6 6 6 ...
##  $ Agriblock                         : chr [1:2789] "Cuddalore" "Kurinjipadi" "Panruti" "Kallakurichi" ...
##  $ Variety                           : chr [1:2789] "CO_43" "ponmani" "delux ponni" "CO_43" ...
##  $ Soil Types                        : chr [1:2789] "alluvial" "clay" "alluvial" "clay" ...
##  $ Seedrate(in Kg)                   : num [1:2789] 150 150 150 150 150 150 150 150 150 150 ...
##  $ LP_Mainfield(in Tonnes)           : num [1:2789] 75 75 75 75 75 75 75 75 75 75 ...
##  $ Nursery                           : chr [1:2789] "dry" "wet" "dry" "wet" ...
##  $ Nursery area (Cents)              : num [1:2789] 120 120 120 120 120 120 120 120 120 120 ...
##  $ LP_nurseryarea(in Tonnes)         : num [1:2789] 6 6 6 6 6 6 6 6 6 6 ...
##  $ DAP_20days                        : num [1:2789] 240 240 240 240 240 240 240 240 240 240 ...
##  $ Weed28D_thiobencarb               : num [1:2789] 12 12 12 12 12 12 12 12 12 12 ...
##  $ Urea_40Days                       : num [1:2789] 163 163 163 163 163 ...
##  $ Potassh_50Days                    : num [1:2789] 62.3 62.3 62.3 62.3 62.3 ...
##  $ Micronutrients_70Days             : num [1:2789] 90 90 90 90 90 90 90 90 90 90 ...
##  $ Pest_60Day(in ml)                 : num [1:2789] 3600 3600 3600 3600 3600 3600 3600 3600 3600 3600 ...
##  $ 30DRain( in mm)                   : num [1:2789] 19.6 19.6 18.5 18.5 18.1 18.1 19.6 19.6 18.5 18.5 ...
##  $ 30DAI(in mm)                      : num [1:2789] 20.4 20.4 21.5 21.5 21.9 21.9 20.4 20.4 21.5 21.5 ...
##  $ 30_50DRain( in mm)                : num [1:2789] 187 187 185 185 186 ...
##  $ 30_50DAI(in mm)                   : num [1:2789] 271 271 273 273 272 ...
##  $ 51_70DRain(in mm)                 : num [1:2789] 167 167 165 165 166 ...
##  $ 51_70AI(in mm)                    : num [1:2789] 250 250 252 252 251 ...
##  $ 71_105DRain(in mm)                : num [1:2789] 61 61 60 60 60.2 60.2 61 61 60 60 ...
##  $ 71_105DAI(in mm)                  : num [1:2789] 64 64 65 65 64.8 64.8 64 64 65 65 ...
##  $ Min temp_D1_D30                   : num [1:2789] 18.5 19.5 20 19 20.5 18 18.5 19.5 20 19 ...
##  $ Max temp_D1_D30                   : num [1:2789] 34 34 35 33 32 31 34 34 35 33 ...
##  $ Min temp_D31_D60                  : num [1:2789] 16 18.5 18 17 17.5 15.5 16 18.5 18 17 ...
##  $ Max temp_D31_D60                  : num [1:2789] 30 35 30 32 28 34 30 35 30 32 ...
##  $ Min temp_D61_D90                  : num [1:2789] 15.5 17 17.5 16.5 18 15 15.5 17 17.5 16.5 ...
##  $ Max temp_D61_D90                  : num [1:2789] 31 32.5 33.5 31.5 34 33 31 32.5 33.5 31.5 ...
##  $ Min temp_D91_D120                 : num [1:2789] 16 16 18 15.5 16.5 15 16 16 18 15.5 ...
##  $ Max temp_D91_D120                 : num [1:2789] 33 30.5 33 32.5 35 31.5 33 30.5 33 32.5 ...
##  $ Inst Wind Speed_D1_D30(in Knots)  : num [1:2789] 4 10 4 8 10 6 4 10 4 8 ...
##  $ Inst Wind Speed_D31_D60(in Knots) : num [1:2789] 10 4 12 6 12 6 10 4 12 6 ...
##  $ Inst Wind Speed_D61_D90(in Knots) : num [1:2789] 8 10 4 8 10 8 8 10 4 8 ...
##  $ Inst Wind Speed_D91_D120(in Knots): num [1:2789] 10 6 12 6 12 10 10 6 12 6 ...
##  $ Wind Direction_D1_D30             : chr [1:2789] "SW" "NW" "ENE" "W" ...
##  $ Wind Direction_D31_D60            : chr [1:2789] "W" "S" "NE" "WNW" ...
##  $ Wind Direction_D61_D90            : chr [1:2789] "NNW" "SE" "NNE" "SE" ...
##  $ Wind Direction_D91_D120           : chr [1:2789] "WSW" "SSE" "W" "S" ...
##  $ Relative Humidity_D1_D30          : num [1:2789] 72 64.6 85 88.5 72.7 78.6 72 64.6 85 88.5 ...
##  $ Relative Humidity_D31_D60         : num [1:2789] 78 85 96 95 91 80 78 85 96 95 ...
##  $ Relative Humidity_D61_D90         : num [1:2789] 88 84 84 81 83 92 88 84 84 81 ...
##  $ Relative Humidity_D91_D120        : num [1:2789] 85 87 79 84 81 88 85 87 79 84 ...
##  $ Trash(in bundles)                 : num [1:2789] 540 600 600 540 600 480 540 480 600 540 ...
##  $ Paddy yield(in Kg)                : num [1:2789] 35028 35412 36300 35016 34044 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Hectares = col_double(),
##   ..   Agriblock = col_character(),
##   ..   Variety = col_character(),
##   ..   `Soil Types` = col_character(),
##   ..   `Seedrate(in Kg)` = col_double(),
##   ..   `LP_Mainfield(in Tonnes)` = col_double(),
##   ..   Nursery = col_character(),
##   ..   `Nursery area (Cents)` = col_double(),
##   ..   `LP_nurseryarea(in Tonnes)` = col_double(),
##   ..   DAP_20days = col_double(),
##   ..   Weed28D_thiobencarb = col_double(),
##   ..   Urea_40Days = col_double(),
##   ..   Potassh_50Days = col_double(),
##   ..   Micronutrients_70Days = col_double(),
##   ..   `Pest_60Day(in ml)` = col_double(),
##   ..   `30DRain( in mm)` = col_double(),
##   ..   `30DAI(in mm)` = col_double(),
##   ..   `30_50DRain( in mm)` = col_double(),
##   ..   `30_50DAI(in mm)` = col_double(),
##   ..   `51_70DRain(in mm)` = col_double(),
##   ..   `51_70AI(in mm)` = col_double(),
##   ..   `71_105DRain(in mm)` = col_double(),
##   ..   `71_105DAI(in mm)` = col_double(),
##   ..   `Min temp_D1_D30` = col_double(),
##   ..   `Max temp_D1_D30` = col_double(),
##   ..   `Min temp_D31_D60` = col_double(),
##   ..   `Max temp_D31_D60` = col_double(),
##   ..   `Min temp_D61_D90` = col_double(),
##   ..   `Max temp_D61_D90` = col_double(),
##   ..   `Min temp_D91_D120` = col_double(),
##   ..   `Max temp_D91_D120` = col_double(),
##   ..   `Inst Wind Speed_D1_D30(in Knots)` = col_double(),
##   ..   `Inst Wind Speed_D31_D60(in Knots)` = col_double(),
##   ..   `Inst Wind Speed_D61_D90(in Knots)` = col_double(),
##   ..   `Inst Wind Speed_D91_D120(in Knots)` = col_double(),
##   ..   `Wind Direction_D1_D30` = col_character(),
##   ..   `Wind Direction_D31_D60` = col_character(),
##   ..   `Wind Direction_D61_D90` = col_character(),
##   ..   `Wind Direction_D91_D120` = col_character(),
##   ..   `Relative Humidity_D1_D30` = col_double(),
##   ..   `Relative Humidity_D31_D60` = col_double(),
##   ..   `Relative Humidity_D61_D90` = col_double(),
##   ..   `Relative Humidity_D91_D120` = col_double(),
##   ..   `Trash(in bundles)` = col_double(),
##   ..   `Paddy yield(in Kg)` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(df);
##     Hectares      Agriblock           Variety           Soil Types       
##  Min.   :1.000   Length:2789        Length:2789        Length:2789       
##  1st Qu.:3.000   Class :character   Class :character   Class :character  
##  Median :4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.717                                                           
##  3rd Qu.:5.000                                                           
##  Max.   :6.000                                                           
##  Seedrate(in Kg)  LP_Mainfield(in Tonnes)   Nursery         
##  Min.   : 25.00   Min.   :12.50           Length:2789       
##  1st Qu.: 75.00   1st Qu.:37.50           Class :character  
##  Median :100.00   Median :50.00           Mode  :character  
##  Mean   : 92.94   Mean   :46.47                             
##  3rd Qu.:125.00   3rd Qu.:62.50                             
##  Max.   :150.00   Max.   :75.00                             
##  Nursery area (Cents) LP_nurseryarea(in Tonnes)   DAP_20days   
##  Min.   : 20.00       Min.   :1.000             Min.   : 40.0  
##  1st Qu.: 60.00       1st Qu.:3.000             1st Qu.:120.0  
##  Median : 80.00       Median :4.000             Median :160.0  
##  Mean   : 74.35       Mean   :3.717             Mean   :148.7  
##  3rd Qu.:100.00       3rd Qu.:5.000             3rd Qu.:200.0  
##  Max.   :120.00       Max.   :6.000             Max.   :240.0  
##  Weed28D_thiobencarb  Urea_40Days     Potassh_50Days  Micronutrients_70Days
##  Min.   : 2.000      Min.   : 27.13   Min.   :10.38   Min.   :15.00        
##  1st Qu.: 6.000      1st Qu.: 81.39   1st Qu.:31.14   1st Qu.:45.00        
##  Median : 8.000      Median :108.52   Median :41.52   Median :60.00        
##  Mean   : 7.435      Mean   :100.85   Mean   :38.59   Mean   :55.76        
##  3rd Qu.:10.000      3rd Qu.:135.65   3rd Qu.:51.90   3rd Qu.:75.00        
##  Max.   :12.000      Max.   :162.78   Max.   :62.28   Max.   :90.00        
##  Pest_60Day(in ml) 30DRain( in mm)  30DAI(in mm)   30_50DRain( in mm)
##  Min.   : 600      Min.   :18.10   Min.   :20.40   Min.   :185.2     
##  1st Qu.:1800      1st Qu.:18.10   1st Qu.:20.40   1st Qu.:185.2     
##  Median :2400      Median :18.50   Median :21.50   Median :185.6     
##  Mean   :2230      Mean   :18.72   Mean   :21.28   Mean   :186.0     
##  3rd Qu.:3000      3rd Qu.:19.60   3rd Qu.:21.90   3rd Qu.:187.2     
##  Max.   :3600      Max.   :19.60   Max.   :21.90   Max.   :187.2     
##  30_50DAI(in mm) 51_70DRain(in mm) 51_70AI(in mm)  71_105DRain(in mm)
##  Min.   :270.8   Min.   :165.3     Min.   :250.0   Min.   :60.00     
##  1st Qu.:270.8   1st Qu.:165.3     1st Qu.:250.0   1st Qu.:60.00     
##  Median :272.4   Median :166.1     Median :250.9   Median :60.20     
##  Mean   :272.0   Mean   :166.2     Mean   :250.8   Mean   :60.41     
##  3rd Qu.:272.8   3rd Qu.:167.0     3rd Qu.:251.7   3rd Qu.:61.00     
##  Max.   :272.8   Max.   :167.0     Max.   :251.7   Max.   :61.00     
##  71_105DAI(in mm) Min temp_D1_D30 Max temp_D1_D30 Min temp_D31_D60
##  Min.   :64.00    Min.   :18.00   Min.   :31.00   Min.   :15.50   
##  1st Qu.:64.00    1st Qu.:18.50   1st Qu.:32.00   1st Qu.:16.00   
##  Median :64.80    Median :19.50   Median :33.00   Median :17.50   
##  Mean   :64.59    Mean   :19.34   Mean   :33.13   Mean   :17.14   
##  3rd Qu.:65.00    3rd Qu.:20.00   3rd Qu.:34.00   3rd Qu.:18.00   
##  Max.   :65.00    Max.   :20.50   Max.   :35.00   Max.   :18.50   
##  Max temp_D31_D60 Min temp_D61_D90 Max temp_D61_D90 Min temp_D91_D120
##  Min.   :28.00    Min.   :15.00    Min.   :31.00    Min.   :15.00    
##  1st Qu.:30.00    1st Qu.:15.50    1st Qu.:31.50    1st Qu.:15.50    
##  Median :30.00    Median :17.00    Median :33.00    Median :16.00    
##  Mean   :31.32    Mean   :16.68    Mean   :32.66    Mean   :16.19    
##  3rd Qu.:34.00    3rd Qu.:17.50    3rd Qu.:33.50    3rd Qu.:16.50    
##  Max.   :35.00    Max.   :18.00    Max.   :34.00    Max.   :18.00    
##  Max temp_D91_D120 Inst Wind Speed_D1_D30(in Knots)
##  Min.   :30.5      Min.   : 4.000                  
##  1st Qu.:31.5      1st Qu.: 4.000                  
##  Median :33.0      Median : 8.000                  
##  Mean   :32.7      Mean   : 7.233                  
##  3rd Qu.:33.0      3rd Qu.:10.000                  
##  Max.   :35.0      Max.   :10.000                  
##  Inst Wind Speed_D31_D60(in Knots) Inst Wind Speed_D61_D90(in Knots)
##  Min.   : 4.000                    Min.   : 4.000                   
##  1st Qu.: 6.000                    1st Qu.: 8.000                   
##  Median :10.000                    Median : 8.000                   
##  Mean   : 8.513                    Mean   : 8.173                   
##  3rd Qu.:12.000                    3rd Qu.:10.000                   
##  Max.   :12.000                    Max.   :10.000                   
##  Inst Wind Speed_D91_D120(in Knots) Wind Direction_D1_D30
##  Min.   : 6.000                     Length:2789          
##  1st Qu.: 6.000                     Class :character     
##  Median :10.000                     Mode  :character     
##  Mean   : 9.449                                          
##  3rd Qu.:12.000                                          
##  Max.   :12.000                                          
##  Wind Direction_D31_D60 Wind Direction_D61_D90 Wind Direction_D91_D120
##  Length:2789            Length:2789            Length:2789            
##  Class :character       Class :character       Class :character       
##  Mode  :character       Mode  :character       Mode  :character       
##                                                                       
##                                                                       
##                                                                       
##  Relative Humidity_D1_D30 Relative Humidity_D31_D60 Relative Humidity_D61_D90
##  Min.   :64.60            Min.   :78.00             Min.   :81.00            
##  1st Qu.:72.00            1st Qu.:80.00             1st Qu.:83.00            
##  Median :72.70            Median :91.00             Median :84.00            
##  Mean   :76.26            Mean   :87.59             Mean   :85.16            
##  3rd Qu.:85.00            3rd Qu.:95.00             3rd Qu.:88.00            
##  Max.   :88.50            Max.   :96.00             Max.   :92.00            
##  Relative Humidity_D91_D120 Trash(in bundles) Paddy yield(in Kg)
##  Min.   :79.00              Min.   : 80.0     Min.   : 5410     
##  1st Qu.:81.00              1st Qu.:240.0     1st Qu.:16389     
##  Median :84.00              Median :360.0     Median :24636     
##  Mean   :83.86              Mean   :335.5     Mean   :22518     
##  3rd Qu.:87.00              3rd Qu.:450.0     3rd Qu.:31035     
##  Max.   :88.00              Max.   :600.0     Max.   :38814

Data Cleanup

Dropping not required, ignorable and un-relatable dimension FILENAME as per dataset description.

# df <- df[ , !(names(df) %in% c("FILENAME"))];
print(head(df, 5));
## # A tibble: 5 × 45
##   Hectares Agriblock    Variety     `Soil Types` `Seedrate(in Kg)`
##      <dbl> <chr>        <chr>       <chr>                    <dbl>
## 1        6 Cuddalore    CO_43       alluvial                   150
## 2        6 Kurinjipadi  ponmani     clay                       150
## 3        6 Panruti      delux ponni alluvial                   150
## 4        6 Kallakurichi CO_43       clay                       150
## 5        6 Sankarapuram ponmani     alluvial                   150
## # ℹ 40 more variables: `LP_Mainfield(in Tonnes)` <dbl>, Nursery <chr>,
## #   `Nursery area (Cents)` <dbl>, `LP_nurseryarea(in Tonnes)` <dbl>,
## #   DAP_20days <dbl>, Weed28D_thiobencarb <dbl>, Urea_40Days <dbl>,
## #   Potassh_50Days <dbl>, Micronutrients_70Days <dbl>,
## #   `Pest_60Day(in ml)` <dbl>, `30DRain( in mm)` <dbl>, `30DAI(in mm)` <dbl>,
## #   `30_50DRain( in mm)` <dbl>, `30_50DAI(in mm)` <dbl>,
## #   `51_70DRain(in mm)` <dbl>, `51_70AI(in mm)` <dbl>, …