dplyr and
tidyr.Before starting working you should know where we are working which is called current working directory
# To check current working directory
getwd()
## [1] "D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R"
We have seen thet we are working in documents [1] “D:/Desktop December 2023/OneDrive/Documents”
Now let us change our working directory because our plot and data is store in Data analysis with R folder. To see it we should do a right click on data and click property.
# Setting Our working directory
setwd("D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R")
We may check it again by using getwd() function
getwd()
## [1] "D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R"
- Environment/History
Pane,
Files/Plots/Packages/Help/Viewer Pane.
Use Ctrl+Enter to run a selected line.
Example 1: Line Plot
plot(icraf_data$Rainfall, type="l", col="blue", main="Rainfall by Site",
xlab="Site Index", ylab="Rainfall (mm)")
Example 2: Scatter Plot
a) using Icraf data
plot(icraf_data$Rainfall, icraf_data$PH, main="Rainfall vs PH",
xlab="Rainfall", ylab="Soil PH", pch=19, col="green")
a) using Bugesera Sample data
df <- read_csv('bugesera_sample.csv')
ggplot(df, aes(x = rainfall_mm, y = seedlings)) + geom_point() + geom_smooth(method = 'lm') + labs(x = 'Rainfall (mm)', y = 'Seedling count', title = 'Seedlings vs Rainfall')
Example 3: Bar Charts Using Icraf Data
barplot(icraf_data$Rainfall, names.arg=icraf_data$Site, col="orange",
main="Average Rainfall per Site", ylab="Rainfall (mm)")
Using Bugesera Sample data
ggplot(df %>% count(species), aes(x = reorder(species, n), y = n)) + geom_col() + coord_flip() + labs(x = 'Species', y = 'Count')
# If you have simple site coordinates
# install.packages('sf')
library(sf)
if('longitude' %in% names(df) & 'latitude' %in% names(df)) {
pts <- st_as_sf(df, coords = c('longitude','latitude'), crs = 4326)
plot(pts['seedlings'])
} else {
print('No coordinates in dataset; add longitude and latitude columns for mapping.')
}
ggplot(df, aes(x = rainfall_mm, y = seedlings)) + geom_point() + facet_wrap(~village) + labs(title='Seedlings vs Rainfall by village')
ggsave('seedlings_vs_rainfall.png', width = 8, height = 5)
plotly::ggplotly() to make ggplots
interactive.leaflet for interactive maps.========================================================================
Before starting data analysis in R programming let us introduce
a) Descriptive Statistics in R #### Dataset Example
rainfall_data <- c(100, 120, 150, 130, 110)
soil_ph_data <- c(6.5, 6.2, 5.9, 6.1, 6.4)
i. Maximum and Minimum The Min and Max function are used to return the minimum & maximum value in a vector or the data frame respectively.
Example
max(rainfall_data)
## [1] 150
min(rainfall_data)
## [1] 100
ii. Mean The mean, or average, of a dataset is calculated by adding all the values in the dataset and then dividing by the number of values in the set.
Example 1
mean(rainfall_data)
## [1] 122
Example 1
mean(icraf_data$Rainfall)
## [1] 156.6667
# You may also do it in this way, if you want to reuse the mean in next calculation
mean_icraf <- mean(icraf_data$Rainfall)
print(mean_icraf)
## [1] 156.6667
3. Median The median of a dataset is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, the middle two values are the median.
Median of a dataset with even number of values, n= Even Let us say that, we have a dataset with the following ten numbers:
24, 16, 30, 10, 12, 28, 38, 2, 4, 36
We can order this dataset from smallest to largest:
2, 4, 10, 12, 16, 24, 28, 30, 36, 38
The medians of this dataset are 16 and 24, because they are the fifth- and sixth-positioned observations in this dataset. In other words, there are four observations to the left of 16, and four observations to the right of 24.
Median of a dataset with odd number of values, n= Odd
If we added another value (say, 28) near the middle of this dataset:
2, 4, 10, 12, 16, 24, 28, 28, 30, 36, 38
The new median is equal to 24, because there are 5 values smaller than it, and 5 values larger than it.
Example 1:
median(rainfall_data)
## [1] 120
Example 2:: Even:
b <- c(3,4,5,6,12)
median(b)
## [1] 5
The code above outputs 5 as the median, because it is the middle number in the array.
Example 3: Odd:
a <- c(3,4,5,12)
median(a)
## [1] 4.5
The code above outputs 4.5, because it takes the average of the two medians, 4 and 5.
4. Mode
The mode is the value that appears most frequently in a dataset. R does not include a built-in mode function for statistical mode, but you can define one easily.
get_mode <- function(x) {
uniq <- unique(x)
uniq[which.max(tabulate(match(x, uniq)))]
}
get_mode(rainfall_data)
## [1] 100
EXample 2
get_mode(icraf_data$Rainfall)
## [1] 120
5. Variance in R
Variance measures how far each number in the set is from the mean. It is the average of the squared differences from the Mean. We can calculate the variance by using var() function in R.
var(icraf_data$Rainfall)
## [1] 1633.333
6. Standard Deviation
Standard Deviation is the square root of variance. It is a measure of the extent to which data varies from the mean. One can calculate the standard deviation by using sd() function in R.
sd(icraf_data$Rainfall)
## [1] 40.41452
# CSV
df_csv <- read_csv('bugesera_sample.csv')
# If you have Excel files:
# install.packages('readxl')
library(readxl)
# df_xlsx <- read_excel('icraf_sample.xlsx', sheet = 1)
In this data cleaning we should first check the summaryy statistics using glimpse() or str() function.
glimpse() is a function that comes from the dplyr package — part of the tidyverse collection of R packages.
🔍 What it does (in simple words):
glimpse() gives you a quick, easy-to-read summary of your dataset.
It’s like turning your data sideways to see:
How many rows and columns you have
The names of the columns
The types of data (e.g., number, character, factor)
Some Useful Functions
df <- df_csv
# Inspect
glimpse(df)
## Rows: 5
## Columns: 9
## $ id <dbl> 1, 2, 3, 4, 5
## $ date <chr> "9/12/2018", "9/13/2018", "9/14/2018", "11/2/2018", "11/5/…
## $ village <chr> "Bugesera", "Bugesera", "Nyagatare", "Nyagatare", "Bugeser…
## $ species <chr> "Grevillea", "Ficus", "Eucalyptus", "Coffee", "Grevillea"
## $ seedlings <dbl> 12, 8, 5, 20, NA
## $ height_cm <dbl> 35, 120, 45, 15, 33
## $ rainfall_mm <dbl> 842, 790, 600, 610, 800
## $ longitude <dbl> 30.369, 30.371, 30.522, 30.530, 30.375
## $ latitude <dbl> -2.17, -2.18, -1.95, -1.96, -2.16
# Handle missing values (example)
df <- df %>% filter(!is.na(seedlings))
# Summary
summary(df)
## id date village species
## Min. :1.00 Length:4 Length:4 Length:4
## 1st Qu.:1.75 Class :character Class :character Class :character
## Median :2.50 Mode :character Mode :character Mode :character
## Mean :2.50
## 3rd Qu.:3.25
## Max. :4.00
## seedlings height_cm rainfall_mm longitude
## Min. : 5.00 Min. : 15.00 Min. :600.0 Min. :30.37
## 1st Qu.: 7.25 1st Qu.: 30.00 1st Qu.:607.5 1st Qu.:30.37
## Median :10.00 Median : 40.00 Median :700.0 Median :30.45
## Mean :11.25 Mean : 53.75 Mean :710.5 Mean :30.45
## 3rd Qu.:14.00 3rd Qu.: 63.75 3rd Qu.:803.0 3rd Qu.:30.52
## Max. :20.00 Max. :120.00 Max. :842.0 Max. :30.53
## latitude
## Min. :-2.180
## 1st Qu.:-2.172
## Median :-2.065
## Mean :-2.065
## 3rd Qu.:-1.958
## Max. :-1.950
# Group summaries
df %>% group_by(village) %>% summarise(total_seedlings = sum(seedlings), avg_height = mean(height_cm))
## # A tibble: 2 × 3
## village total_seedlings avg_height
## <chr> <dbl> <dbl>
## 1 Bugesera 20 77.5
## 2 Nyagatare 25 30
# Cross-tabulation
table(df$species, df$village)
##
## Bugesera Nyagatare
## Coffee 0 1
## Eucalyptus 0 1
## Ficus 1 0
## Grevillea 1 0
# Example: model seedlings count ~ species + mean annual rainfall (synthetic)
mod <- glm(seedlings ~ species + rainfall_mm, data = df, family = poisson())
summary(mod)
##
## Call:
## glm(formula = seedlings ~ species + rainfall_mm, family = poisson(),
## data = df)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.9957 0.2236 13.397 < 2e-16 ***
## speciesEucalyptus -1.3863 0.5000 -2.773 0.00556 **
## speciesFicus -0.9163 0.4183 -2.190 0.02850 *
## speciesGrevillea -0.5108 0.3651 -1.399 0.16183
## rainfall_mm NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1.0999e+01 on 3 degrees of freedom
## Residual deviance: 1.5543e-15 on 0 degrees of freedom
## AIC: 24.597
##
## Number of Fisher Scoring iterations: 3
write_csv(df, 'bugesera_cleaned.csv')
broom::tidy(mod) %>% write_csv('model_coefficients.csv')
data_cleaned_2025-10-06.csv).# Read the dataset
data <- read.csv("D:/Desktop December 2023/OneDrive/Desktop/Data analysis with R/EasternCrop_sample.csv")
str(data)
## 'data.frame': 200 obs. of 7 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ rainfall : num 825 1170 1039 959 694 ...
## $ yield : num 1891 943 1075 2328 1831 ...
## $ soil_ph : num 4.81 7.21 6.02 6.98 5.46 ...
## $ temperature: num 20 21.3 20.1 19.1 19.4 ...
## $ crop_type : chr "Tomato" "Cassava" "Beans" "Maize" ...
## $ region : chr "Kayonza" "Bugesera" "Kayonza" "Bugesera" ...
summary(data)
## id rainfall yield soil_ph
## Min. : 1.00 Min. : 603.3 Min. : 808.6 Min. :4.533
## 1st Qu.: 50.75 1st Qu.: 737.1 1st Qu.:1244.5 1st Qu.:5.267
## Median :100.50 Median : 896.7 Median :1720.8 Median :6.076
## Mean :100.50 Mean : 890.4 Mean :1657.4 Mean :6.062
## 3rd Qu.:150.25 3rd Qu.:1054.1 3rd Qu.:2061.7 3rd Qu.:6.933
## Max. :200.00 Max. :1192.1 Max. :2483.9 Max. :7.499
## temperature crop_type region
## Min. :18.22 Length:200 Length:200
## 1st Qu.:20.83 Class :character Class :character
## Median :23.61 Mode :character Mode :character
## Mean :23.75
## 3rd Qu.:26.69
## Max. :29.88
colSums(is.na(data))
## id rainfall yield soil_ph temperature crop_type
## 0 0 0 0 0 0
## region
## 0
We’ll randomly introduce missing values in numeric columns and handle them using mean imputation.
set.seed(123)
# Randomly replace 5% of numeric values with NA
for (col in names(data)) {
if (is.numeric(data[[col]])) {
na_index <- sample(1:nrow(data), size = 0.05 * nrow(data))
data[na_index, col] <- NA
}
}
# Check missingness
colSums(is.na(data))
## id rainfall yield soil_ph temperature crop_type
## 10 10 10 10 10 0
## region
## 0
# Mean imputation for numeric columns
data <- data %>% mutate(across(where(is.numeric),
~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)))
We’ll create two new variables:
- irrigated → “Yes” or “No”
- maturity_status → “Mature” or “Non-Mature”
set.seed(42)
data <- data %>%
mutate(
irrigated = sample(c("Yes", "No"), nrow(data), replace = TRUE),
maturity_status = sample(c("Mature", "Non-Mature"), nrow(data), replace = TRUE)
)
head(data)
## id rainfall yield soil_ph temperature crop_type region irrigated
## 1 1 824.7241 1891.4538 4.809372 20.02722 Tomato Kayonza Yes
## 2 2 1170.4286 943.0379 7.207659 21.34308 Cassava Bugesera Yes
## 3 3 1039.1964 1074.7688 6.015757 20.12413 Beans Kayonza Yes
## 4 4 959.1951 2327.5421 6.979372 19.06443 Maize Bugesera Yes
## 5 5 693.6112 1830.9294 5.460149 19.44763 Tomato Kayonza No
## 6 6 693.5967 815.6350 7.186570 23.52935 Cassava Bugesera No
## maturity_status
## 1 Non-Mature
## 2 Non-Mature
## 3 Mature
## 4 Non-Mature
## 5 Mature
## 6 Mature
Explore individual columns using statistics and visualizations.
# Summary statistics for numeric columns
summarytools::descr(data, stats = c("mean", "sd", "min", "max"))
## Descriptive Statistics
## data
## N: 200
##
## id rainfall soil_ph temperature yield
## ------------- -------- ---------- --------- ------------- ---------
## Mean 98.84 892.82 6.03 23.71 1655.86
## Std.Dev 55.41 172.39 0.89 3.27 484.90
## Min 1.00 603.31 4.53 18.22 808.60
## Max 200.00 1192.13 7.50 29.88 2483.86
# Histogram example
ggplot(data, aes(x = yield)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Yield", x = "Yield", y = "Count")
Explore relationships between numeric variables and categorical features.
# Scatter plot: rainfall vs yield
ggplot(data, aes(x = rainfall, y = yield, color = irrigated)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Rainfall vs Yield by Irrigation", x = "Rainfall (mm)", y = "Yield (kg)")
Correlation
# Correlation plot for numeric variables
numeric_data <- data %>% dplyr::select(where(is.numeric))
corrplot(cor(numeric_data), method = "circle")
Second Method
numeric_data <- dplyr::select(data, where(is.numeric))
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
corrplot::corrplot(correlation_matrix, method = "circle")
Use boxplots and IQR method to identify outliers.
# Boxplot example
ggplot(data, aes(y = yield)) +
geom_boxplot(fill = "orange", color = "black") +
labs(title = "Boxplot of Yield") +
theme_minimal()
# Identify outliers using IQR method
Q1 <- quantile(data$yield, 0.25)
Q3 <- quantile(data$yield, 0.75)
IQR <- Q3 - Q1
outliers <- data %>% filter(yield < (Q1 - 1.5*IQR) | yield > (Q3 + 1.5*IQR))
nrow(outliers)
## [1] 0
Create new features that provide additional insights.
data <- data %>% mutate(
yield_per_mm = yield / rainfall,
yield_category = ifelse(yield > mean(yield), "High", "Low")
)
head(data[, c("yield", "rainfall", "yield_per_mm", "yield_category")])
## yield rainfall yield_per_mm yield_category
## 1 1891.4538 824.7241 2.2934383 High
## 2 943.0379 1170.4286 0.8057202 Low
## 3 1074.7688 1039.1964 1.0342307 Low
## 4 2327.5421 959.1951 2.4265576 High
## 5 1830.9294 693.6112 2.6397057 High
## 6 815.6350 693.5967 1.1759499 Low
✅ Added irrigation and maturity
status columns
✅ Introduced and imputed missing values using mean
imputation
✅ Conducted univariate and bivariate analysis with visualizations
✅ Identified outliers and engineered new features for better
insights
```
======================================================================
1. Comments
Comments are written using the
#symbol.3. Basic R syntax
Concatenate Elements
Use
paste()orpaste0()to join text.Second Example
4. Data structures
Math Operations
a) Strings
b) Booleans (Logical Values)
Operators
Nested If
AND / OR Operators
For Loop
Functions
In R a function is like a little machine in R that you give some input (called arguments) and it gives you some output after doing a specific job
Example 1
Example 2
c) Lists
d) Matrices
e) Arrays
f) Data Frames
The column called site is character
5. Read and write data
a) Read CSV Data
6. Simple example: summary of seedling counts (uses sample file)
==========================================================================