Load neede Package

Course 1: Introduction to R Programming

Objectives

  • Install and set up R and RStudio.
  • Understand RStudio interface and basic R syntax.
  • Understanding Different data types
  • Work with data frames and basic data types.
  • Read and write data (CSV, Excel).
  • Basic data wrangling with dplyr and tidyr.
  • Short example using Agroforest data.

Before starting working you should know where we are working which is called current working directory

# To check current working directory
getwd()
## [1] "D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R"

We have seen thet we are working in documents [1] “D:/Desktop December 2023/OneDrive/Documents”

Now let us change our working directory because our plot and data is store in Data analysis with R folder. To see it we should do a right click on data and click property.

# Setting Our working directory
setwd("D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R")

We may check it again by using getwd() function

getwd()
## [1] "D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R"

1. Install R and RStudio

2. RStudio layout quick tour

  • Source Pane (script editor)

  • Console pane,

- Environment/History Pane,

  • Files/Plots/Packages/Help/Viewer Pane.

  • Use Ctrl+Enter to run a selected line.

1. Comments

Comments are written using the # symbol.

# This is a comment
# Comments are ignored by R when running the code

3. Basic R syntax

Concatenate Elements

Use paste() or paste0() to join text.

paste("ICRAF", "Rwanda", "Data Analyst")
## [1] "ICRAF Rwanda Data Analyst"
paste0("Rainfall_", 2025)
## [1] "Rainfall_2025"

Second Example

# arithmetic
2 + 2
## [1] 4
# variables
x <- 10
y <- seq(1,5)
x * y
## [1] 10 20 30 40 50
# functions
mean(y); sd(y)
## [1] 3
## [1] 1.581139

4. Data structures

# Numeric
x <- 42   
# character
name <- "ICRAF"  
# vectors
v <- c(1,2,3)
# factors
f <- factor(c('tree','crop','tree'))
# logical
is_raining <- TRUE 
# complex
complex_num <- 2 + 3i 
# data.frame
df <- data.frame(id = 1:4, species = c('Grevillea','Ficus','Eucalyptus','Coffee'), dbh = c(5.2, 10.1, 7.4, 2.3))
str(df)
## 'data.frame':    4 obs. of  3 variables:
##  $ id     : int  1 2 3 4
##  $ species: chr  "Grevillea" "Ficus" "Eucalyptus" "Coffee"
##  $ dbh    : num  5.2 10.1 7.4 2.3

Math Operations

rainfall_mm <- c(80, 120, 100, 90)
mean(rainfall_mm) + 10
## [1] 107.5

a) Strings

site <- "Bugesera"
paste("Site:", site)
## [1] "Site: Bugesera"

b) Booleans (Logical Values)

soil_ph <- 6.2
is_neutral <- soil_ph == 7
is_neutral
## [1] FALSE

Operators

a <- 10; b <- 5
a + b
## [1] 15
a > b
## [1] TRUE
(a > 5) & (b < 10)
## [1] TRUE

Nested If

rain <- 110
if (rain > 100) {
  print("High rainfall")
} else if (rain > 80) {
  print("Moderate rainfall")
} else {
  print("Low rainfall")
}
## [1] "High rainfall"

AND / OR Operators

soil_moisture <- 25
if (rain > 90 & soil_moisture > 20) print("Good for planting")
## [1] "Good for planting"

For Loop

for (i in 1:5) {
  print(paste("Tree sample", i))
}
## [1] "Tree sample 1"
## [1] "Tree sample 2"
## [1] "Tree sample 3"
## [1] "Tree sample 4"
## [1] "Tree sample 5"
rainfall <- c(Bugesera=110, Nyagatare=150, Musanze=200)

for (site in names(rainfall)) {
  if (rainfall[site] > 130) {
    print(paste(site, "has high rainfall"))
  } else {
    print(paste(site, "has moderate rainfall"))
  }
}
## [1] "Bugesera has moderate rainfall"
## [1] "Nyagatare has high rainfall"
## [1] "Musanze has high rainfall"

Functions

In R a function is like a little machine in R that you give some input (called arguments) and it gives you some output after doing a specific job

Example 1

add_numbers <- function(x, y) {
  result <- x + y
  return(result)
}

add_numbers(3, 5)  # Output will be 8
## [1] 8

Example 2

calc_biomass <- function(diameter, height) {
  biomass <- 0.05 * diameter^2 * height
  return(biomass)
}
calc_biomass(10, 5)
## [1] 25

c) Lists

site_info <- list(name="Bugesera", rainfall=120, ph=6.3)
site_info
## $name
## [1] "Bugesera"
## 
## $rainfall
## [1] 120
## 
## $ph
## [1] 6.3

d) Matrices

matrix_data <- matrix(1:9, nrow=3, ncol=3)
matrix_data
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

e) Arrays

array_data <- array(1:12, dim=c(3,2,2))
array_data
## , , 1
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    7   10
## [2,]    8   11
## [3,]    9   12

f) Data Frames

icraf_data <- data.frame(
  Site=c("Bugesera","Nyagatare","Musanze"),
  Rainfall=c(120, 150, 200),
  PH=c(6.3, 5.9, 6.7)
)
icraf_data
##        Site Rainfall  PH
## 1  Bugesera      120 6.3
## 2 Nyagatare      150 5.9
## 3   Musanze      200 6.7
str(icraf_data)
## 'data.frame':    3 obs. of  3 variables:
##  $ Site    : chr  "Bugesera" "Nyagatare" "Musanze"
##  $ Rainfall: num  120 150 200
##  $ PH      : num  6.3 5.9 6.7

The column called site is character

5. Read and write data

a) Read CSV Data

library(readr)
d <- read_csv('bugesera_sample.csv')
head(d)
## # A tibble: 5 × 9
##      id date  village species seedlings height_cm rainfall_mm longitude latitude
##   <dbl> <chr> <chr>   <chr>       <dbl>     <dbl>       <dbl>     <dbl>    <dbl>
## 1     1 9/12… Bugese… Grevil…        12        35         842      30.4    -2.17
## 2     2 9/13… Bugese… Ficus           8       120         790      30.4    -2.18
## 3     3 9/14… Nyagat… Eucaly…         5        45         600      30.5    -1.95
## 4     4 11/2… Nyagat… Coffee         20        15         610      30.5    -1.96
## 5     5 11/5… Bugese… Grevil…        NA        33         800      30.4    -2.16
write_csv(d, 'bugesera_copy.csv')

6. Simple example: summary of seedling counts (uses sample file)

library(dplyr)
d <- read_csv('bugesera_sample.csv')
d %>% group_by(village) %>% summarise(mean_seedlings = mean(seedlings), n = n())
## # A tibble: 2 × 3
##   village   mean_seedlings     n
##   <chr>              <dbl> <int>
## 1 Bugesera            NA       3
## 2 Nyagatare           12.5     2

==========================================================================

Course 2: Data Visualizations

Principles of good visualization

  • Clear title and axis labels
  • Use appropriate plot type for variable types
  • Add informative legends
  • Consider accessibility (colorblind-friendly palettes)

1. Basic ggplot examples

Example 1: Line Plot

plot(icraf_data$Rainfall, type="l", col="blue", main="Rainfall by Site",
     xlab="Site Index", ylab="Rainfall (mm)")

Example 2: Scatter Plot

a) using Icraf data

plot(icraf_data$Rainfall, icraf_data$PH, main="Rainfall vs PH",
     xlab="Rainfall", ylab="Soil PH", pch=19, col="green")

a) using Bugesera Sample data

df <- read_csv('bugesera_sample.csv')
ggplot(df, aes(x = rainfall_mm, y = seedlings)) + geom_point() + geom_smooth(method = 'lm') + labs(x = 'Rainfall (mm)', y = 'Seedling count', title = 'Seedlings vs Rainfall')

Example 3: Bar Charts Using Icraf Data

barplot(icraf_data$Rainfall, names.arg=icraf_data$Site, col="orange",
        main="Average Rainfall per Site", ylab="Rainfall (mm)")

Using Bugesera Sample data

ggplot(df %>% count(species), aes(x = reorder(species, n), y = n)) + geom_col() + coord_flip() + labs(x = 'Species', y = 'Count')

2. Maps & spatial plots (basic)

# If you have simple site coordinates
# install.packages('sf')
library(sf)
if('longitude' %in% names(df) & 'latitude' %in% names(df)) {
  pts <- st_as_sf(df, coords = c('longitude','latitude'), crs = 4326)
  plot(pts['seedlings'])
} else {
  print('No coordinates in dataset; add longitude and latitude columns for mapping.')
}

3. Multi-panel plots and saving

ggplot(df, aes(x = rainfall_mm, y = seedlings)) + geom_point() + facet_wrap(~village) + labs(title='Seedlings vs Rainfall by village')

ggsave('seedlings_vs_rainfall.png', width = 8, height = 5)

4. Interactive visualization (optional)

  • Use plotly::ggplotly() to make ggplots interactive.
  • leaflet for interactive maps.

========================================================================

Course 3: Danata analysis

Goals

  • Data import and cleaning
  • Exploratory data analysis (EDA)
  • Simple modeling (GLM)
  • Reproducible reporting with R Markdown

Before starting data analysis in R programming let us introduce

1. Introduction to Statistics

a) Descriptive Statistics in R #### Dataset Example

rainfall_data <- c(100, 120, 150, 130, 110)
soil_ph_data <- c(6.5, 6.2, 5.9, 6.1, 6.4)

i. Maximum and Minimum The Min and Max function are used to return the minimum & maximum value in a vector or the data frame respectively.

Example

max(rainfall_data)
## [1] 150
min(rainfall_data)
## [1] 100

ii. Mean The mean, or average, of a dataset is calculated by adding all the values in the dataset and then dividing by the number of values in the set.

Example 1

mean(rainfall_data)
## [1] 122

Example 1

mean(icraf_data$Rainfall)
## [1] 156.6667
# You may also do it in this way, if you want to reuse the mean in next calculation

mean_icraf <- mean(icraf_data$Rainfall)
print(mean_icraf)
## [1] 156.6667

3. Median The median of a dataset is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, the middle two values are the median.

Median of a dataset with even number of values, n= Even Let us say that, we have a dataset with the following ten numbers:

24, 16, 30, 10, 12, 28, 38, 2, 4, 36

We can order this dataset from smallest to largest:

2, 4, 10, 12, 16, 24, 28, 30, 36, 38

The medians of this dataset are 16 and 24, because they are the fifth- and sixth-positioned observations in this dataset. In other words, there are four observations to the left of 16, and four observations to the right of 24.

Median of a dataset with odd number of values, n= Odd

If we added another value (say, 28) near the middle of this dataset:

2, 4, 10, 12, 16, 24, 28, 28, 30, 36, 38

The new median is equal to 24, because there are 5 values smaller than it, and 5 values larger than it.

Example 1:

median(rainfall_data)
## [1] 120

Example 2:: Even:

b <- c(3,4,5,6,12)
median(b)
## [1] 5

The code above outputs 5 as the median, because it is the middle number in the array.

Example 3: Odd:

a <- c(3,4,5,12)
median(a)
## [1] 4.5

The code above outputs 4.5, because it takes the average of the two medians, 4 and 5.

4. Mode

The mode is the value that appears most frequently in a dataset. R does not include a built-in mode function for statistical mode, but you can define one easily.

get_mode <- function(x) {
  uniq <- unique(x)
  uniq[which.max(tabulate(match(x, uniq)))]
}
get_mode(rainfall_data)
## [1] 100

EXample 2

get_mode(icraf_data$Rainfall)
## [1] 120

5. Variance in R

Variance measures how far each number in the set is from the mean. It is the average of the squared differences from the Mean. We can calculate the variance by using var() function in R.

var(icraf_data$Rainfall)
## [1] 1633.333

6. Standard Deviation

Standard Deviation is the square root of variance. It is a measure of the extent to which data varies from the mean. One can calculate the standard deviation by using sd() function in R.

sd(icraf_data$Rainfall)
## [1] 40.41452

2. Importing common data file types

# CSV
df_csv <- read_csv('bugesera_sample.csv')

# If you have Excel files:
# install.packages('readxl')
library(readxl)
# df_xlsx <- read_excel('icraf_sample.xlsx', sheet = 1)

3. Data cleaning patterns

In this data cleaning we should first check the summaryy statistics using glimpse() or str() function.

glimpse() is a function that comes from the dplyr package — part of the tidyverse collection of R packages.

🔍 What it does (in simple words):

glimpse() gives you a quick, easy-to-read summary of your dataset.

It’s like turning your data sideways to see:

  • How many rows and columns you have

  • The names of the columns

  • The types of data (e.g., number, character, factor)

Some Useful Functions

df <- df_csv
# Inspect
glimpse(df)
## Rows: 5
## Columns: 9
## $ id          <dbl> 1, 2, 3, 4, 5
## $ date        <chr> "9/12/2018", "9/13/2018", "9/14/2018", "11/2/2018", "11/5/…
## $ village     <chr> "Bugesera", "Bugesera", "Nyagatare", "Nyagatare", "Bugeser…
## $ species     <chr> "Grevillea", "Ficus", "Eucalyptus", "Coffee", "Grevillea"
## $ seedlings   <dbl> 12, 8, 5, 20, NA
## $ height_cm   <dbl> 35, 120, 45, 15, 33
## $ rainfall_mm <dbl> 842, 790, 600, 610, 800
## $ longitude   <dbl> 30.369, 30.371, 30.522, 30.530, 30.375
## $ latitude    <dbl> -2.17, -2.18, -1.95, -1.96, -2.16
# Handle missing values (example)
df <- df %>% filter(!is.na(seedlings))

3. EDA examples

# Summary
summary(df)
##        id           date             village            species         
##  Min.   :1.00   Length:4           Length:4           Length:4          
##  1st Qu.:1.75   Class :character   Class :character   Class :character  
##  Median :2.50   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2.50                                                           
##  3rd Qu.:3.25                                                           
##  Max.   :4.00                                                           
##    seedlings       height_cm       rainfall_mm      longitude    
##  Min.   : 5.00   Min.   : 15.00   Min.   :600.0   Min.   :30.37  
##  1st Qu.: 7.25   1st Qu.: 30.00   1st Qu.:607.5   1st Qu.:30.37  
##  Median :10.00   Median : 40.00   Median :700.0   Median :30.45  
##  Mean   :11.25   Mean   : 53.75   Mean   :710.5   Mean   :30.45  
##  3rd Qu.:14.00   3rd Qu.: 63.75   3rd Qu.:803.0   3rd Qu.:30.52  
##  Max.   :20.00   Max.   :120.00   Max.   :842.0   Max.   :30.53  
##     latitude     
##  Min.   :-2.180  
##  1st Qu.:-2.172  
##  Median :-2.065  
##  Mean   :-2.065  
##  3rd Qu.:-1.958  
##  Max.   :-1.950
# Group summaries
df %>% group_by(village) %>% summarise(total_seedlings = sum(seedlings), avg_height = mean(height_cm))
## # A tibble: 2 × 3
##   village   total_seedlings avg_height
##   <chr>               <dbl>      <dbl>
## 1 Bugesera               20       77.5
## 2 Nyagatare              25       30
# Cross-tabulation
table(df$species, df$village)
##             
##              Bugesera Nyagatare
##   Coffee            0         1
##   Eucalyptus        0         1
##   Ficus             1         0
##   Grevillea         1         0

4. Simple statistical model

# Example: model seedlings count ~ species + mean annual rainfall (synthetic)
mod <- glm(seedlings ~ species + rainfall_mm, data = df, family = poisson())
summary(mod)
## 
## Call:
## glm(formula = seedlings ~ species + rainfall_mm, family = poisson(), 
##     data = df)
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.9957     0.2236  13.397  < 2e-16 ***
## speciesEucalyptus  -1.3863     0.5000  -2.773  0.00556 ** 
## speciesFicus       -0.9163     0.4183  -2.190  0.02850 *  
## speciesGrevillea   -0.5108     0.3651  -1.399  0.16183    
## rainfall_mm             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1.0999e+01  on 3  degrees of freedom
## Residual deviance: 1.5543e-15  on 0  degrees of freedom
## AIC: 24.597
## 
## Number of Fisher Scoring iterations: 3

5. Export cleaned data and results

write_csv(df, 'bugesera_cleaned.csv')
broom::tidy(mod) %>% write_csv('model_coefficients.csv')

6. Tips for working with CIFOR-ICRAF datasets

  • Keep a data dictionary file (.csv or .xlsx) documenting variables and units.
  • Version your cleaned datasets with date stamps (e.g., data_cleaned_2025-10-06.csv).

# Exercices:

1️⃣ Load Data and Packages

# Read the dataset
data <- read.csv("D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R/EasternCrop_sample.csv")

2️⃣ Understand the Structure

str(data)
## 'data.frame':    200 obs. of  7 variables:
##  $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ rainfall   : num  825 1170 1039 959 694 ...
##  $ yield      : num  1891 943 1075 2328 1831 ...
##  $ soil_ph    : num  4.81 7.21 6.02 6.98 5.46 ...
##  $ temperature: num  20 21.3 20.1 19.1 19.4 ...
##  $ crop_type  : chr  "Tomato" "Cassava" "Beans" "Maize" ...
##  $ region     : chr  "Kayonza" "Bugesera" "Kayonza" "Bugesera" ...
summary(data)
##        id            rainfall          yield           soil_ph     
##  Min.   :  1.00   Min.   : 603.3   Min.   : 808.6   Min.   :4.533  
##  1st Qu.: 50.75   1st Qu.: 737.1   1st Qu.:1244.5   1st Qu.:5.267  
##  Median :100.50   Median : 896.7   Median :1720.8   Median :6.076  
##  Mean   :100.50   Mean   : 890.4   Mean   :1657.4   Mean   :6.062  
##  3rd Qu.:150.25   3rd Qu.:1054.1   3rd Qu.:2061.7   3rd Qu.:6.933  
##  Max.   :200.00   Max.   :1192.1   Max.   :2483.9   Max.   :7.499  
##   temperature     crop_type            region         
##  Min.   :18.22   Length:200         Length:200        
##  1st Qu.:20.83   Class :character   Class :character  
##  Median :23.61   Mode  :character   Mode  :character  
##  Mean   :23.75                                        
##  3rd Qu.:26.69                                        
##  Max.   :29.88

3️⃣ Check for Missing Values

colSums(is.na(data))
##          id    rainfall       yield     soil_ph temperature   crop_type 
##           0           0           0           0           0           0 
##      region 
##           0

4️⃣ Add Missing Values and Handle Them

We’ll randomly introduce missing values in numeric columns and handle them using mean imputation.

set.seed(123)
# Randomly replace 5% of numeric values with NA
for (col in names(data)) {
  if (is.numeric(data[[col]])) {
    na_index <- sample(1:nrow(data), size = 0.05 * nrow(data))
    data[na_index, col] <- NA
  }
}
# Check missingness
colSums(is.na(data))
##          id    rainfall       yield     soil_ph temperature   crop_type 
##          10          10          10          10          10           0 
##      region 
##           0
# Mean imputation for numeric columns
data <- data %>% mutate(across(where(is.numeric),
                               ~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)))

5️⃣ Add New Columns

We’ll create two new variables:
- irrigated → “Yes” or “No”
- maturity_status → “Mature” or “Non-Mature”

set.seed(42)
data <- data %>% 
  mutate(
    irrigated = sample(c("Yes", "No"), nrow(data), replace = TRUE),
    maturity_status = sample(c("Mature", "Non-Mature"), nrow(data), replace = TRUE)
  )

head(data)
##   id  rainfall     yield  soil_ph temperature crop_type   region irrigated
## 1  1  824.7241 1891.4538 4.809372    20.02722    Tomato  Kayonza       Yes
## 2  2 1170.4286  943.0379 7.207659    21.34308   Cassava Bugesera       Yes
## 3  3 1039.1964 1074.7688 6.015757    20.12413     Beans  Kayonza       Yes
## 4  4  959.1951 2327.5421 6.979372    19.06443     Maize Bugesera       Yes
## 5  5  693.6112 1830.9294 5.460149    19.44763    Tomato  Kayonza        No
## 6  6  693.5967  815.6350 7.186570    23.52935   Cassava Bugesera        No
##   maturity_status
## 1      Non-Mature
## 2      Non-Mature
## 3          Mature
## 4      Non-Mature
## 5          Mature
## 6          Mature

6️⃣ Univariate Analysis

Explore individual columns using statistics and visualizations.

# Summary statistics for numeric columns
summarytools::descr(data, stats = c("mean", "sd", "min", "max"))
## Descriptive Statistics  
## data  
## N: 200  
## 
##                     id   rainfall   soil_ph   temperature     yield
## ------------- -------- ---------- --------- ------------- ---------
##          Mean    98.84     892.82      6.03         23.71   1655.86
##       Std.Dev    55.41     172.39      0.89          3.27    484.90
##           Min     1.00     603.31      4.53         18.22    808.60
##           Max   200.00    1192.13      7.50         29.88   2483.86
# Histogram example
ggplot(data, aes(x = yield)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Yield", x = "Yield", y = "Count")

7️⃣ Bivariate Analysis

Explore relationships between numeric variables and categorical features.

# Scatter plot: rainfall vs yield
ggplot(data, aes(x = rainfall, y = yield, color = irrigated)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Rainfall vs Yield by Irrigation", x = "Rainfall (mm)", y = "Yield (kg)")

Correlation

# Correlation plot for numeric variables
numeric_data <- data %>% dplyr::select(where(is.numeric))
corrplot(cor(numeric_data), method = "circle")

Second Method

numeric_data <- dplyr::select(data, where(is.numeric))
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
corrplot::corrplot(correlation_matrix, method = "circle")

8️⃣ Detect Outliers

Use boxplots and IQR method to identify outliers.

# Boxplot example
ggplot(data, aes(y = yield)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Boxplot of Yield") +
  theme_minimal()

# Identify outliers using IQR method
Q1 <- quantile(data$yield, 0.25)
Q3 <- quantile(data$yield, 0.75)
IQR <- Q3 - Q1
outliers <- data %>% filter(yield < (Q1 - 1.5*IQR) | yield > (Q3 + 1.5*IQR))
nrow(outliers)
## [1] 0

9️⃣ Feature Engineering

Create new features that provide additional insights.

data <- data %>% mutate(
  yield_per_mm = yield / rainfall,
  yield_category = ifelse(yield > mean(yield), "High", "Low")
)

head(data[, c("yield", "rainfall", "yield_per_mm", "yield_category")])
##       yield  rainfall yield_per_mm yield_category
## 1 1891.4538  824.7241    2.2934383           High
## 2  943.0379 1170.4286    0.8057202            Low
## 3 1074.7688 1039.1964    1.0342307            Low
## 4 2327.5421  959.1951    2.4265576           High
## 5 1830.9294  693.6112    2.6397057           High
## 6  815.6350  693.5967    1.1759499            Low

🔟 Conclusions and Insights

✅ Added irrigation and maturity status columns
✅ Introduced and imputed missing values using mean imputation
✅ Conducted univariate and bivariate analysis with visualizations
✅ Identified outliers and engineered new features for better insights

```

======================================================================