1 Abstract

This RMarkdown file follows the six-step analysis process to investigate the impact of State and Irrigation practices on corn yield.

2 Data Source

The dataset used in this analysis is sourced from the USDA National Agricultural Statistics Service (NASS) Quick Stats database. It contains corn yield data (in bushels per acre) for three states: Colorado (CO), Kansas (KS), and Nebraska (NE), categorized by irrigation status (Irrigated vs. Non-Irrigated).

3 Research Question and Hypotheses

The goal of this analysis is to understand how geographical location (State) and water management (Irrigation) synergistically affect corn yields.

Research Questions:

  1. Does corn yield differ significantly across Colorado (CO), Kansas (KS), and Nebraska (NE)?
  2. Does irrigation significantly increase yield compared to non-irrigated practices?
  3. Is there an interaction effect between State and Irrigation status?

Hypotheses:

  • \(H_0\): There is no significant effect of State, Irrigation, or their interaction on yield.
  • \(H_1\) : At least one factor or the interaction has a significant effect on yield.

4 Data Loading and Descriptive Statistics

First, we clean the data by filtering for Grain yield, converting values to numeric, and categorizing the irrigation status.

# Load the dataset
raw_data <- read.csv("E970D53B-70E7-39B7-A8AA-007B2E9B3FFF.csv")

# Data cleaning and feature engineering
clean_data <- raw_data %>%
  mutate(Value = as.numeric(gsub(",", "", Value))) %>%
  filter(!is.na(Value), str_detect(Data.Item, "GRAIN")) %>%
  mutate(Irrigation = ifelse(str_detect(Data.Item, "NON-IRRIGATED"), "Non-Irrigated", "Irrigated")) %>%
  select(State, Irrigation, Value)

# Summary table
clean_data %>%
  group_by(State, Irrigation) %>%
  summarise(
    Mean_Yield = mean(Value),
    SD_Yield = sd(Value),
    Count = n()
  ) %>%
  knitr::kable(caption = "Descriptive Statistics of Corn Yield")
Descriptive Statistics of Corn Yield
State Irrigation Mean_Yield SD_Yield Count
COLORADO Irrigated 189.4294 9.562672 17
COLORADO Non-Irrigated 55.8250 18.791300 12
KANSAS Irrigated 190.0118 13.107149 17
KANSAS Non-Irrigated 97.5000 22.388146 12
NEBRASKA Irrigated 201.4059 12.195515 17
NEBRASKA Non-Irrigated 132.1500 28.543093 12

4.1 Visualization: Boxplot

The boxplot allows us to visualize the distribution and potential outliers across groups.

ggplot(clean_data, aes(x = State, y = Value, fill = Irrigation)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Corn Yield Distribution by State and Irrigation Practice",
    y = "Yield (BU / ACRE)",
    x = "State"
  ) +
  theme_minimal()


5 Checking ANOVA Assumptions

To ensure our ANOVA results are valid, we check for normality and homogeneity of variance.

5.1 1. Homogeneity of Variance (Levene’s Test)

We use Levene’s test to check if the variance is equal across all groups.

leveneTest(Value ~ State * Irrigation, data = clean_data)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  5  1.9939 0.08827 .
##       81                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.2 2. Normality of Residuals

We check if the residuals follow a normal distribution.

model <- aov(Value ~ State * Irrigation, data = clean_data)
shapiro.test(residuals(model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model)
## W = 0.91871, p-value = 4.131e-05
plot(model, which = 2) # Q-Q Plot


6 Two-way ANOVA Results

The ANOVA table provides the F-statistics and p-values for our primary factors and their interaction.

anova_summary <- summary(model)
print(anova_summary)
##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## State             2  21665   10833   35.17 1.01e-11 ***
## Irrigation        1 204574  204574  664.12  < 2e-16 ***
## State:Irrigation  2  14937    7468   24.25 5.60e-09 ***
## Residuals        81  24951     308                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:

  • F-statistic: Represents the ratio of variance between groups to the variance within groups. A high F-value (especially for Irrigation) indicates the factor explains a large portion of the yield variation.
  • p-value: Since all p-values are significantly less than 0.05, we reject the null hypothesis for State, Irrigation, and their Interaction.

7 Post-hoc Testing and Interpretation

Because the interaction effect is significant, we perform a Tukey HSD test to compare specific group means.

# Interaction plot to visualize the effect
with(clean_data, interaction.plot(State, Irrigation, Value,
  main = "Interaction Effect of State and Irrigation",
  xlab = "State", ylab = "Mean Yield", col = c("blue", "red")
))

# Tukey HSD for pairwise comparisons
TukeyHSD(model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Value ~ State * Irrigation, data = clean_data)
## 
## $State
##                       diff       lwr      upr    p adj
## KANSAS-COLORADO   17.58621  6.581704 28.59071 0.000766
## NEBRASKA-COLORADO 38.60345 27.598945 49.60795 0.000000
## NEBRASKA-KANSAS   21.01724 10.012739 32.02174 0.000053
## 
## $Irrigation
##                              diff      lwr       upr p adj
## Non-Irrigated-Irrigated -98.45735 -106.059 -90.85566     0
## 
## $`State:Irrigation`
##                                                       diff         lwr
## KANSAS:Irrigated-COLORADO:Irrigated              0.5823529  -16.990220
## NEBRASKA:Irrigated-COLORADO:Irrigated           11.9764706   -5.596103
## COLORADO:Non-Irrigated-COLORADO:Irrigated     -133.6044118 -152.920925
## KANSAS:Non-Irrigated-COLORADO:Irrigated        -91.9294118 -111.245925
## NEBRASKA:Non-Irrigated-COLORADO:Irrigated      -57.2794118  -76.595925
## NEBRASKA:Irrigated-KANSAS:Irrigated             11.3941176   -6.178455
## COLORADO:Non-Irrigated-KANSAS:Irrigated       -134.1867647 -153.503278
## KANSAS:Non-Irrigated-KANSAS:Irrigated          -92.5117647 -111.828278
## NEBRASKA:Non-Irrigated-KANSAS:Irrigated        -57.8617647  -77.178278
## COLORADO:Non-Irrigated-NEBRASKA:Irrigated     -145.5808824 -164.897396
## KANSAS:Non-Irrigated-NEBRASKA:Irrigated       -103.9058824 -123.222396
## NEBRASKA:Non-Irrigated-NEBRASKA:Irrigated      -69.2558824  -88.572396
## KANSAS:Non-Irrigated-COLORADO:Non-Irrigated     41.6750000   20.759454
## NEBRASKA:Non-Irrigated-COLORADO:Non-Irrigated   76.3250000   55.409454
## NEBRASKA:Non-Irrigated-KANSAS:Non-Irrigated     34.6500000   13.734454
##                                                      upr     p adj
## KANSAS:Irrigated-COLORADO:Irrigated             18.15493 0.9999988
## NEBRASKA:Irrigated-COLORADO:Irrigated           29.54904 0.3573570
## COLORADO:Non-Irrigated-COLORADO:Irrigated     -114.28790 0.0000000
## KANSAS:Non-Irrigated-COLORADO:Irrigated        -72.61290 0.0000000
## NEBRASKA:Non-Irrigated-COLORADO:Irrigated      -37.96290 0.0000000
## NEBRASKA:Irrigated-KANSAS:Irrigated             28.96669 0.4140654
## COLORADO:Non-Irrigated-KANSAS:Irrigated       -114.87025 0.0000000
## KANSAS:Non-Irrigated-KANSAS:Irrigated          -73.19525 0.0000000
## NEBRASKA:Non-Irrigated-KANSAS:Irrigated        -38.54525 0.0000000
## COLORADO:Non-Irrigated-NEBRASKA:Irrigated     -126.26437 0.0000000
## KANSAS:Non-Irrigated-NEBRASKA:Irrigated        -84.58937 0.0000000
## NEBRASKA:Non-Irrigated-NEBRASKA:Irrigated      -49.93937 0.0000000
## KANSAS:Non-Irrigated-COLORADO:Non-Irrigated     62.59055 0.0000017
## NEBRASKA:Non-Irrigated-COLORADO:Non-Irrigated   97.24055 0.0000000
## NEBRASKA:Non-Irrigated-KANSAS:Non-Irrigated     55.56555 0.0000890

Plain Language Interpretation: Although Nebraska achieves the highest absolute yields under both irrigated and non-irrigated conditions, but the impact of irrigation technology is most transformative in Colorado, where it prevents a total crop failure by increasing average yield from a meager 55.8 BU/ACRE to 189.4 BU/ACRE. This highlights that irrigation provides the highest marginal utility in the most arid environments.


8 Limitations and Significance

Limitations:

  • External Factors: This analysis does not account for soil quality (Terroir) or specific corn varieties (Hybrids) which might vary by state.
  • Forecast Data: Some rows include “Forecast” values, which may differ slightly from the final actual harvested data.

Real-world Significance:

  1. Water Policy: In water-scarce years, Colorado’s agricultural sector is at much higher risk than Nebraska’s if irrigation is restricted.
  2. Resilience: Nebraska’s high non-irrigated yield (132.2 BU/ACRE) suggests better natural precipitation or more advanced dryland farming techniques that could be studied and exported to other states.