This RMarkdown file follows the six-step analysis process to investigate the impact of State and Irrigation practices on corn yield.
The dataset used in this analysis is sourced from the USDA National Agricultural Statistics Service (NASS) Quick Stats database. It contains corn yield data (in bushels per acre) for three states: Colorado (CO), Kansas (KS), and Nebraska (NE), categorized by irrigation status (Irrigated vs. Non-Irrigated).
The goal of this analysis is to understand how geographical location (State) and water management (Irrigation) synergistically affect corn yields.
Research Questions:
Hypotheses:
First, we clean the data by filtering for Grain yield, converting values to numeric, and categorizing the irrigation status.
# Load the dataset
raw_data <- read.csv("E970D53B-70E7-39B7-A8AA-007B2E9B3FFF.csv")
# Data cleaning and feature engineering
clean_data <- raw_data %>%
mutate(Value = as.numeric(gsub(",", "", Value))) %>%
filter(!is.na(Value), str_detect(Data.Item, "GRAIN")) %>%
mutate(Irrigation = ifelse(str_detect(Data.Item, "NON-IRRIGATED"), "Non-Irrigated", "Irrigated")) %>%
select(State, Irrigation, Value)
# Summary table
clean_data %>%
group_by(State, Irrigation) %>%
summarise(
Mean_Yield = mean(Value),
SD_Yield = sd(Value),
Count = n()
) %>%
knitr::kable(caption = "Descriptive Statistics of Corn Yield")| State | Irrigation | Mean_Yield | SD_Yield | Count |
|---|---|---|---|---|
| COLORADO | Irrigated | 189.4294 | 9.562672 | 17 |
| COLORADO | Non-Irrigated | 55.8250 | 18.791300 | 12 |
| KANSAS | Irrigated | 190.0118 | 13.107149 | 17 |
| KANSAS | Non-Irrigated | 97.5000 | 22.388146 | 12 |
| NEBRASKA | Irrigated | 201.4059 | 12.195515 | 17 |
| NEBRASKA | Non-Irrigated | 132.1500 | 28.543093 | 12 |
The boxplot allows us to visualize the distribution and potential outliers across groups.
ggplot(clean_data, aes(x = State, y = Value, fill = Irrigation)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Corn Yield Distribution by State and Irrigation Practice",
y = "Yield (BU / ACRE)",
x = "State"
) +
theme_minimal()To ensure our ANOVA results are valid, we check for normality and homogeneity of variance.
We use Levene’s test to check if the variance is equal across all groups.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 1.9939 0.08827 .
## 81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We check if the residuals follow a normal distribution.
##
## Shapiro-Wilk normality test
##
## data: residuals(model)
## W = 0.91871, p-value = 4.131e-05
The ANOVA table provides the F-statistics and p-values for our primary factors and their interaction.
## Df Sum Sq Mean Sq F value Pr(>F)
## State 2 21665 10833 35.17 1.01e-11 ***
## Irrigation 1 204574 204574 664.12 < 2e-16 ***
## State:Irrigation 2 14937 7468 24.25 5.60e-09 ***
## Residuals 81 24951 308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:
Because the interaction effect is significant, we perform a Tukey HSD test to compare specific group means.
# Interaction plot to visualize the effect
with(clean_data, interaction.plot(State, Irrigation, Value,
main = "Interaction Effect of State and Irrigation",
xlab = "State", ylab = "Mean Yield", col = c("blue", "red")
))## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Value ~ State * Irrigation, data = clean_data)
##
## $State
## diff lwr upr p adj
## KANSAS-COLORADO 17.58621 6.581704 28.59071 0.000766
## NEBRASKA-COLORADO 38.60345 27.598945 49.60795 0.000000
## NEBRASKA-KANSAS 21.01724 10.012739 32.02174 0.000053
##
## $Irrigation
## diff lwr upr p adj
## Non-Irrigated-Irrigated -98.45735 -106.059 -90.85566 0
##
## $`State:Irrigation`
## diff lwr
## KANSAS:Irrigated-COLORADO:Irrigated 0.5823529 -16.990220
## NEBRASKA:Irrigated-COLORADO:Irrigated 11.9764706 -5.596103
## COLORADO:Non-Irrigated-COLORADO:Irrigated -133.6044118 -152.920925
## KANSAS:Non-Irrigated-COLORADO:Irrigated -91.9294118 -111.245925
## NEBRASKA:Non-Irrigated-COLORADO:Irrigated -57.2794118 -76.595925
## NEBRASKA:Irrigated-KANSAS:Irrigated 11.3941176 -6.178455
## COLORADO:Non-Irrigated-KANSAS:Irrigated -134.1867647 -153.503278
## KANSAS:Non-Irrigated-KANSAS:Irrigated -92.5117647 -111.828278
## NEBRASKA:Non-Irrigated-KANSAS:Irrigated -57.8617647 -77.178278
## COLORADO:Non-Irrigated-NEBRASKA:Irrigated -145.5808824 -164.897396
## KANSAS:Non-Irrigated-NEBRASKA:Irrigated -103.9058824 -123.222396
## NEBRASKA:Non-Irrigated-NEBRASKA:Irrigated -69.2558824 -88.572396
## KANSAS:Non-Irrigated-COLORADO:Non-Irrigated 41.6750000 20.759454
## NEBRASKA:Non-Irrigated-COLORADO:Non-Irrigated 76.3250000 55.409454
## NEBRASKA:Non-Irrigated-KANSAS:Non-Irrigated 34.6500000 13.734454
## upr p adj
## KANSAS:Irrigated-COLORADO:Irrigated 18.15493 0.9999988
## NEBRASKA:Irrigated-COLORADO:Irrigated 29.54904 0.3573570
## COLORADO:Non-Irrigated-COLORADO:Irrigated -114.28790 0.0000000
## KANSAS:Non-Irrigated-COLORADO:Irrigated -72.61290 0.0000000
## NEBRASKA:Non-Irrigated-COLORADO:Irrigated -37.96290 0.0000000
## NEBRASKA:Irrigated-KANSAS:Irrigated 28.96669 0.4140654
## COLORADO:Non-Irrigated-KANSAS:Irrigated -114.87025 0.0000000
## KANSAS:Non-Irrigated-KANSAS:Irrigated -73.19525 0.0000000
## NEBRASKA:Non-Irrigated-KANSAS:Irrigated -38.54525 0.0000000
## COLORADO:Non-Irrigated-NEBRASKA:Irrigated -126.26437 0.0000000
## KANSAS:Non-Irrigated-NEBRASKA:Irrigated -84.58937 0.0000000
## NEBRASKA:Non-Irrigated-NEBRASKA:Irrigated -49.93937 0.0000000
## KANSAS:Non-Irrigated-COLORADO:Non-Irrigated 62.59055 0.0000017
## NEBRASKA:Non-Irrigated-COLORADO:Non-Irrigated 97.24055 0.0000000
## NEBRASKA:Non-Irrigated-KANSAS:Non-Irrigated 55.56555 0.0000890
Plain Language Interpretation: Although Nebraska achieves the highest absolute yields under both irrigated and non-irrigated conditions, but the impact of irrigation technology is most transformative in Colorado, where it prevents a total crop failure by increasing average yield from a meager 55.8 BU/ACRE to 189.4 BU/ACRE. This highlights that irrigation provides the highest marginal utility in the most arid environments.
Limitations:
Real-world Significance: