Growing Through Change: The Earth’s Farms Face Rising Heat

26 Years of Change (1998-2024)

1.Hemanth Rangaswamy(s4069811), 2.Ananya Solanki(s4019236), 3.Kiran Kumar Reddy Alleddula(s4086075)

Last updated: 19 October, 2024

Overview

Our work explores the impact of climate change on farming, specifically how temperature, rainfall, and CO2 emissions influence crop yields. This is an essential issue today, as farmers around the world are facing changing climate conditions that affect food production. To guide ourselves, we relied on various modules from our course, particularly (Module 1), which helped us understand how to define our variables and structure our assignment.

Why This Matters?

Climate change is a critical issue now, and its impact on farming is something that cannot be ignored. As we look into this data, we hope to reveal meaningful insights that can help farmers better adapt to changing weather patterns. Finding a suitable dataset was one of the early challenges we faced, but after some research, we found a dataset that provides a global view of farming works.

Exploring Climate Effects on Agriculture:

The dataset Overview

Where the Data Came From:

We used a dataset titled climate_change_impact_on_agriculture_2024.csv, which includes information on temperature, precipitation, CO2 emissions, crop yields, and economic impacts across different regions. While we didn’t collect this data ourselves, it seemed comprehensive enough to fit the needs of our analysis.

Sampling and Reliability:

Our understanding of the key variables:

Dataset Preparation

Getting the data ready before we could start analyzing, we had to clean up our data and make sure it was in good shape. Here’s what we did:

What’s in Our Data?

We used a file called “climate_change_impact_on_agriculture_2024.csv”. It’s like a big spreadsheet with lots of information about how climate change is affecting farming around the world from 1998 to 2024.

It includes things like:

Sometimes there were missing pieces of information. We couldn’t just leave them empty, so:

Source link: https://www.kaggle.com/Our_datasetsets/waqi786/climate-change-impact-on-agriculture

# Loading necessary libraries for our tasks. 
library(dplyr) # For data manipulation
library(stringr) # For string handling tasks
library(ggplot2) # For data visualization
library(corrplot) # For correlation matrix plotting

Loading ,displaying and showing first few rows of the dataset. (Update the file path)

climate_data <- read.csv("C:/Users/Hemanth Gowda/Downloads/archive (2)/climate_change_impact_on_agriculture_2024.csv")

str(climate_data)  # To display structure of the dataset
## 'data.frame':    10000 obs. of  15 variables:
##  $ Year                       : int  2001 2024 2001 2001 1998 2019 1997 2021 2012 2018 ...
##  $ Country                    : chr  "India" "China" "France" "Canada" ...
##  $ Region                     : chr  "West Bengal" "North" "Ile-de-France" "Prairies" ...
##  $ Crop_Type                  : chr  "Corn" "Corn" "Wheat" "Coffee" ...
##  $ Average_Temperature_C      : num  1.55 3.23 21.11 27.85 2.19 ...
##  $ Total_Precipitation_mm     : num  447 2914 1302 1154 1627 ...
##  $ CO2_Emissions_MT           : num  15.2 29.8 25.8 13.9 11.8 ...
##  $ Crop_Yield_MT_per_HA       : num  1.74 1.74 1.72 3.89 1.08 ...
##  $ Extreme_Weather_Events     : int  8 8 5 5 9 5 2 4 1 1 ...
##  $ Irrigation_Access_.        : num  14.5 11.1 84.4 94.1 95.8 ...
##  $ Pesticide_Use_KG_per_HA    : num  10.1 33.1 27.4 14.4 44.4 ...
##  $ Fertilizer_Use_KG_per_HA   : num  14.8 23.2 65.5 87.6 88.1 ...
##  $ Soil_Health_Index          : num  83.2 54 67.8 91.4 49.6 ...
##  $ Adaptation_Strategies      : chr  "Water Management" "Crop Rotation" "Water Management" "No Adaptation" ...
##  $ Economic_Impact_Million_USD: num  808 616 797 790 402 ...
head(climate_data) # To show the first few records
print("Column names before renaming:")
## [1] "Column names before renaming:"
print(colnames(climate_data))
##  [1] "Year"                        "Country"                    
##  [3] "Region"                      "Crop_Type"                  
##  [5] "Average_Temperature_C"       "Total_Precipitation_mm"     
##  [7] "CO2_Emissions_MT"            "Crop_Yield_MT_per_HA"       
##  [9] "Extreme_Weather_Events"      "Irrigation_Access_."        
## [11] "Pesticide_Use_KG_per_HA"     "Fertilizer_Use_KG_per_HA"   
## [13] "Soil_Health_Index"           "Adaptation_Strategies"      
## [15] "Economic_Impact_Million_USD"
# Renaming specific columns
colnames(climate_data)[colnames(climate_data) == "Soil_Health_Index"] <- "Soil_Quality_Index"
colnames(climate_data)[colnames(climate_data) == "Year"] <- "Recording_Year"

Printing structure of the dataset

print("Column names after renaming:")
## [1] "Column names after renaming:"
print(colnames(climate_data))
##  [1] "Recording_Year"              "Country"                    
##  [3] "Region"                      "Crop_Type"                  
##  [5] "Average_Temperature_C"       "Total_Precipitation_mm"     
##  [7] "CO2_Emissions_MT"            "Crop_Yield_MT_per_HA"       
##  [9] "Extreme_Weather_Events"      "Irrigation_Access_."        
## [11] "Pesticide_Use_KG_per_HA"     "Fertilizer_Use_KG_per_HA"   
## [13] "Soil_Quality_Index"          "Adaptation_Strategies"      
## [15] "Economic_Impact_Million_USD"
# Counting the missing values to check for incomplete entries
climate_data$missing_count <- rowSums(is.na(climate_data))
missing_count <- colSums(is.na(climate_data))
print(missing_count)
##              Recording_Year                     Country 
##                           0                           0 
##                      Region                   Crop_Type 
##                           0                           0 
##       Average_Temperature_C      Total_Precipitation_mm 
##                           0                           0 
##            CO2_Emissions_MT        Crop_Yield_MT_per_HA 
##                           0                           0 
##      Extreme_Weather_Events         Irrigation_Access_. 
##                           0                           0 
##     Pesticide_Use_KG_per_HA    Fertilizer_Use_KG_per_HA 
##                           0                           0 
##          Soil_Quality_Index       Adaptation_Strategies 
##                           0                           0 
## Economic_Impact_Million_USD               missing_count 
##                           0                           0
# Checking the columns to identify any special characters
special_values_check <- sapply(climate_data, function(x) sum(is.infinite(x) | is.nan(x)))
print(special_values_check)
##              Recording_Year                     Country 
##                           0                           0 
##                      Region                   Crop_Type 
##                           0                           0 
##       Average_Temperature_C      Total_Precipitation_mm 
##                           0                           0 
##            CO2_Emissions_MT        Crop_Yield_MT_per_HA 
##                           0                           0 
##      Extreme_Weather_Events         Irrigation_Access_. 
##                           0                           0 
##     Pesticide_Use_KG_per_HA    Fertilizer_Use_KG_per_HA 
##                           0                           0 
##          Soil_Quality_Index       Adaptation_Strategies 
##                           0                           0 
## Economic_Impact_Million_USD               missing_count 
##                           0                           0
str(climate_data)
## 'data.frame':    10000 obs. of  16 variables:
##  $ Recording_Year             : int  2001 2024 2001 2001 1998 2019 1997 2021 2012 2018 ...
##  $ Country                    : chr  "India" "China" "France" "Canada" ...
##  $ Region                     : chr  "West Bengal" "North" "Ile-de-France" "Prairies" ...
##  $ Crop_Type                  : chr  "Corn" "Corn" "Wheat" "Coffee" ...
##  $ Average_Temperature_C      : num  1.55 3.23 21.11 27.85 2.19 ...
##  $ Total_Precipitation_mm     : num  447 2914 1302 1154 1627 ...
##  $ CO2_Emissions_MT           : num  15.2 29.8 25.8 13.9 11.8 ...
##  $ Crop_Yield_MT_per_HA       : num  1.74 1.74 1.72 3.89 1.08 ...
##  $ Extreme_Weather_Events     : int  8 8 5 5 9 5 2 4 1 1 ...
##  $ Irrigation_Access_.        : num  14.5 11.1 84.4 94.1 95.8 ...
##  $ Pesticide_Use_KG_per_HA    : num  10.1 33.1 27.4 14.4 44.4 ...
##  $ Fertilizer_Use_KG_per_HA   : num  14.8 23.2 65.5 87.6 88.1 ...
##  $ Soil_Quality_Index         : num  83.2 54 67.8 91.4 49.6 ...
##  $ Adaptation_Strategies      : chr  "Water Management" "Crop Rotation" "Water Management" "No Adaptation" ...
##  $ Economic_Impact_Million_USD: num  808 616 797 790 402 ...
##  $ missing_count              : num  0 0 0 0 0 0 0 0 0 0 ...

Breaking Down the Numbers with Visuals

In this section, we’ll look into how we approached descriptive statistics and visualizations to summarize and explore the relationships between the key climate and agricultural variables in the dataset. Throughout this stage, we referred to multiple module notes to guide our analysis and ensure the accuracy of our results.

We picked Average Temperature, Total Precipitation, CO2 Emissions, Crop Yield, and Economic Impact because they directly show how climate change affects farming. Crop yield will tell us how much production has happened, and the economic impact will show us the financial ups and downs caused by climate changes. Together, these factors will help us understand how climate affects both the land and the economy.

Economic Impact (Million USD): Represents the financial losses or gains caused by climate impacts on agriculture. In Module 1, we learnt the importance of identifying key variables before performing any deeper analysis. We explored descriptive statistics, such as the mean, median, and standard deviation for these variables to summarize their central tendencies and spread.

summary_data <- climate_data %>%
  summarise(
    # Average temperature in degrees Celsius
    mean_temp = mean(Average_Temperature_C, na.rm = TRUE),
    median_temp = median(Average_Temperature_C, na.rm = TRUE),
    sd_temp = sd(Average_Temperature_C, na.rm = TRUE),
    # Total precipitation in millimeters
    mean_precip = mean(Total_Precipitation_mm, na.rm = TRUE),
    median_precip = median(Total_Precipitation_mm, na.rm = TRUE),
    sd_precip = sd(Total_Precipitation_mm, na.rm = TRUE),
    # CO2 emissions in metric tons
    mean_co2 = mean(CO2_Emissions_MT, na.rm = TRUE),
    median_co2 = median(CO2_Emissions_MT, na.rm = TRUE),
    sd_co2 = sd(CO2_Emissions_MT, na.rm = TRUE),
    # Crop yield in metric tons per hectare
    mean_yield = mean(Crop_Yield_MT_per_HA, na.rm = TRUE),
    median_yield = median(Crop_Yield_MT_per_HA, na.rm = TRUE),
    sd_yield = sd(Crop_Yield_MT_per_HA, na.rm = TRUE)
  )
# Display the calculated summary statistics
print(summary_data)
##   mean_temp median_temp  sd_temp mean_precip median_precip sd_precip mean_co2
## 1   15.2413      15.175 11.46695    1611.664       1611.16  805.0168 15.24661
##   median_co2   sd_co2 mean_yield median_yield  sd_yield
## 1       15.2 8.589423   2.240017         2.17 0.9983415

Our understanding from the above summary statistics

Visual Insights

In Module 2, we learned the significance of choosing the right visualizations to tell a data-driven story. This module helped us select the most effective plots to highlight the key relationships in our dataset.

Visualizing the Impact of Adaptation Strategies with a Bar Chart

ggplot(climate_data %>% 
         group_by(Adaptation_Strategies), 
       aes(x = Adaptation_Strategies, y = Extreme_Weather_Events, fill = Adaptation_Strategies)) +
  geom_bar(stat = "identity") +
  labs(title = "Impact by Adaptation Strategies", x = "Adaptation Strategies", y = "Extreme Weather Events") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  ylim(0, 10) +
  scale_fill_manual(values = c(
    "Crop Rotation" = "mediumseagreen",   
    "Drought-resistant Crops" = "brown", 
    "No Adaptation" = "black",        
    "Organic Farming" = "yellowgreen",         
    "Water Management" = "lightblue"    
  ))

The bar chart is showing us that different Adaptation Strategies impact extreme weather events in varying ways, with “No Adaptation” leading to fewer events compared to strategies like crop rotation or water management.

Visualizing the Relationship Between Crop Yield and Precipitation

ggplot(climate_data %>% head(10), aes(x = Pesticide_Use_KG_per_HA, y = Crop_Yield_MT_per_HA)) +
  geom_point(color = "blue",size = 3) +
  labs(title = "Crop Yield vs. Pesticide Used ", x = "Pesticide Used KG per HA", y = "Crop Yield (MT per Hectare)") +
  theme_minimal()

The scatter plot on Pesticide Use vs Crop Yield shows us that using more pesticides initially boosts yields, but too much reduces productivity, meaning balance is key.

Visualizing the Average Economic Impact by Region in a Bar Chart

ggplot(climate_data %>% 
         group_by(Region) %>% 
        summarise(mean_economic_impact = mean(Economic_Impact_Million_USD, na.rm = TRUE)), 
      aes(x = Region, y = mean_economic_impact)) +
  geom_bar(stat = "identity", fill = "orange") +
 labs(title = "Average Economic Impact by Region", x = "Region", y = "Economic Impact (Million USD)") +
 theme_minimal() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

The economic impact bar chart shows us that most regions face similar effects from climate change, with some regions like North Central and Pampas seeing slightly higher impacts.

Testing the Hypothesis: Comparing Crop Yields in India and China with Confidence Intervals

We conducted hypothesis testing to examine whether there is a significant difference in crop yields between India and China.

Null Hypothesis (H₀): \[H_0: \mu_{India} = \mu_{China}\]

Alternative hypothesis suggests that the mean crop yields in India and China are not equal, indicating a potential difference between the two countries’ crop yields.

Alternative (H₁): \[H_1: \mu_{India} \neq \mu_{China}\]

These hypotheses were tested using a two-sample t-test to determine if the observed difference in the means of crop yields is statistically significant. Before applying the t-test, we used the Shapiro-Wilk test to check whether the crop yield data follows a normal distribution, which is an assumption of the t-test.

Two-Sample t-test

We used a two-sample t-test to compare the means of crop yields between India and China. The equation we used for the t-test is: \[t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\] Where: \(\bar{X}_1\) and \(\bar{X}_2\) are the sample means for India and China. \(s_1^2\) and \(s_2^2\) are the sample variances for India and China. \(n_1\) and \(n_2\) are the sample sizes for India and China.

We calculated a 95% confidence interval to estimate the range of values within which the true difference in crop yields lies. The equation for the confidence interval is: \[CI = (\bar{X}_1 - \bar{X}_2) \pm t_{\alpha/2} \times \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

Where: \(\bar{X}_1 - \bar{X}_2\) is the difference in the sample means. \(t_{\alpha/2}\) is the critical value from the t-distribution for a 95% confidence level.

# Filtering dataset for Corn yields in India and China
corn_data <- climate_data %>% filter(Crop_Type == "Corn" & Country %in% c("India", "China"))
# Checking normality of crop yield with the Shapiro-Wilk test
shapiro.test(corn_data$Crop_Yield_MT_per_HA)
## 
##  Shapiro-Wilk normality test
## 
## data:  corn_data$Crop_Yield_MT_per_HA
## W = 0.96426, p-value = 0.0001014
# Summary statistics for corn yields in India and China
corn_summary <- corn_data %>%
  group_by(Country) %>%
  summarise(
    mean_yield = mean(Crop_Yield_MT_per_HA, na.rm = TRUE),
    sd_yield = sd(Crop_Yield_MT_per_HA, na.rm = TRUE),
    count = n()
  )

Showing Summary Statistics

# Performing a t-test to compare yields between India and China
yield_t_test <- t.test(Crop_Yield_MT_per_HA ~ Country, data = corn_data)
# Displaying the t-test results
print(yield_t_test)
## 
##  Welch Two Sample t-test
## 
## data:  Crop_Yield_MT_per_HA by Country
## t = 0.15795, df = 183.16, p-value = 0.8747
## alternative hypothesis: true difference in means between group China and group India is not equal to 0
## 95 percent confidence interval:
##  -0.2655497  0.3117677
## sample estimates:
## mean in group China mean in group India 
##            2.131677            2.108568
# Extracting and printing the 95% confidence interval for the difference in yields
print(yield_t_test$conf.int)
## [1] -0.2655497  0.3117677
## attr(,"conf.level")
## [1] 0.95

Analyzing How Climate Factors Affect Crop Yields with Regression

We applied multiple linear regression to model the relationship between crop yield (the dependent variable) and two independent variables: average temperature and total precipitation. This allowed us to estimate how much crop yields change in response to variations in these climate factors. In Module 9, we learned the use of multiple linear regression, learning how to select independent variables and interpret the regression output. This understanding helped guide the construction of our model. Multiple Linear Regression The regression model to predict the crop yield based on two independent variables (average temperature and total precipitation) is given by the equation: \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\] Where: \(Y\) represents the crop yield (dependent variable), - \(X_1\) is the average temperature, \(X_2\) is the total precipitation,\(\beta_0\) is the intercept (the predicted value of crop yield when all independent variables are zero), - \(\beta_1\) and \(\beta_2\) are the regression coefficients for temperature and precipitation, respectively, \(\epsilon\) is the error term, representing the variability in crop yields not explained by the independent variables.

Coefficient Interpretation Each regression coefficient \(\beta_i\) tells us how much the dependent variable (crop yield) changes for a one-unit increase in the corresponding independent variable, holding all other variables constant. The formula for the regression coefficient is: \[ \hat{\beta}_i = \frac{\text{Cov}(X_i, Y)}{\text{Var}(X_i)} \] Where:\(\text{Cov}(X_i, Y)\) is the covariance between the independent variable \(X_i\) and the dependent variable \(Y\), \(\text{Var}(X_i)\) is the variance of the independent variable \(X_i\).

Confidence Intervals for Coefficients

To estimate the uncertainty around the regression coefficients, we can also calculate confidence intervals: \[\hat{\beta}_i \pm t_{\alpha/2} \times SE(\hat{\beta}_i)\] Where: \(\hat{\beta}_i\) is the estimated regression coefficient, \(t_{\alpha/2}\) is the critical value from the t-distribution for a given confidence level (typically 95%), \(SE(\hat{\beta}_i)\) is the standard error of the regression coefficient.

Visualizing the Relationship Between Temperature and Crop Yield with a Regression Line

ggplot(corn_data, aes(x = Average_Temperature_C, y = Crop_Yield_MT_per_HA)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue",se = FALSE) +
  labs(title = "Linearity Check: Crop Yield vs. Temperature", x = "Average Temperature (degree celsius )", y = "Crop Yield (MT per Hectare)") +
  theme_minimal()

The scatter plot is showing us a small positive link between temperature and crop yield. The data points are scattered away from the regression line because temperature alone isn’t a strong predictor of crop yield. We believe that the spread suggests that other factors, like rainfall, soil type, or farming methods, are also playing big role in determining crop output. The weak relationship between temperature and yield, shown by the scattered points, basically means that yield outcomes are influenced by a combination of different variables, not just temperature, leading to variability in the data.

Multiple linear regression: Modeling the effect of temperature and precipitation on crop yield

reg_model <- lm(Crop_Yield_MT_per_HA ~ Average_Temperature_C + Total_Precipitation_mm, data = corn_data)
summary(reg_model)
## 
## Call:
## lm(formula = Crop_Yield_MT_per_HA ~ Average_Temperature_C + Total_Precipitation_mm, 
##     data = corn_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.69124 -0.73614 -0.09887  0.64441  2.39503 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.718e+00  1.821e-01   9.436  < 2e-16 ***
## Average_Temperature_C  1.892e-02  6.663e-03   2.840  0.00502 ** 
## Total_Precipitation_mm 8.218e-05  8.673e-05   0.947  0.34462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9797 on 185 degrees of freedom
## Multiple R-squared:  0.04892,    Adjusted R-squared:  0.03863 
## F-statistic: 4.757 on 2 and 185 DF,  p-value: 0.009666

The regression model explains 5% of crop yield variation (R-squared = 0.049) and is statistically significant.

Visualizing Diagnostic Plots to Test Model Accuracy

The Residuals vs Fitted plot checks if the model’s errors are scattered randomly and evenly. The Q-Q plot checks if these errors follow a normal pattern.

par(mfrow = c(1, 2))  # Set layout for two plots
# Plotting Residuals vs Fitted (which = 1)
plot(reg_model, which = 1)
# Plottng Q-Q Plot (which = 2)
plot(reg_model, which = 2)

par(mfrow = c(1, 1))

The Main Takeaways and discussion on limitations and strengths

Final Thoughts

The take away message from our work is that temperature changes are already impacting crop yields, and without proactive measures, we think these effects will likely become more severe in the future.

References

citation("dplyr"); citation("stringr"); citation("ggplot2"); citation("corrplot")
## To cite package 'dplyr' in publications use:
## 
##   Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
##   Grammar of Data Manipulation_. R package version 1.1.4,
##   <https://CRAN.R-project.org/package=dplyr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {dplyr: A Grammar of Data Manipulation},
##     author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
##     year = {2023},
##     note = {R package version 1.1.4},
##     url = {https://CRAN.R-project.org/package=dplyr},
##   }
## To cite package 'stringr' in publications use:
## 
##   Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common
##   String Operations_. R package version 1.5.1,
##   <https://CRAN.R-project.org/package=stringr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {stringr: Simple, Consistent Wrappers for Common String Operations},
##     author = {Hadley Wickham},
##     year = {2023},
##     note = {R package version 1.5.1},
##     url = {https://CRAN.R-project.org/package=stringr},
##   }
## To cite ggplot2 in publications, please use
## 
##   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
##   Springer-Verlag New York, 2016.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     author = {Hadley Wickham},
##     title = {ggplot2: Elegant Graphics for Data Analysis},
##     publisher = {Springer-Verlag New York},
##     year = {2016},
##     isbn = {978-3-319-24277-4},
##     url = {https://ggplot2.tidyverse.org},
##   }
## To cite corrplot in publications use:
## 
##   Taiyun Wei and Viliam Simko (2024). R package 'corrplot':
##   Visualization of a Correlation Matrix (Version 0.94). Available from
##   https://github.com/taiyun/corrplot
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{corrplot2024,
##     title = {R package 'corrplot': Visualization of a Correlation Matrix},
##     author = {Taiyun Wei and Viliam Simko},
##     year = {2024},
##     note = {(Version 0.94)},
##     url = {https://github.com/taiyun/corrplot},
##   }