About the Data
The data is from
Kenya
National Bureau of Statistics Data Tables.
It is about the Population Distribution of Households by County
according to the 2019 Kenya Population and Housing Census.
It contains the following fields;
-
County
-
Total_Population
-
Male_Population
-
Female_Population
-
Intersex_Population
-
Total_Household
-
Conventional_Household
-
Group_Quarters
-
Land_Area_SqKm
-
Density_Persons_per_SqKm
Steps Taken
-
Importing Necessary R Packages
-
Loading Data to R
-
Understanding the Data
-
Data Cleaning and Preparation - Missing Values
-
Exploratory Data Analysis - Visualizations
-
Findings
Importing necessary packages
library(readxl) # To read excel data into R
library(tidyverse) # To tidy data
library(plotly) # To create interactive visualizations
library(knitr) # For report generation
library(kableExtra) # To display table
library(dplyr) # For data analysis
Load and Preview the Data
# Load data to a dataFrame
pop_data <- read_excel("C:/Users/KNBS/Documents/PROJECTS/Education Loans R Shiny/Population-households-density-by-county.xlsx", skip = 4)
# Preview the Data
head(pop_data)
Understanding the structure of the Data
str(pop_data)
tibble [48 × 10] (S3: tbl_df/tbl/data.frame)
$ County : chr [1:48] "MOMBASA" "KWALE" "KILIFI" "TANA RIVER" ...
$ Total_Population : num [1:48] 1208333 866820 1453787 315943 143920 ...
$ Male_Population : num [1:48] 610257 425121 704089 158550 76103 ...
$ Female_Population : num [1:48] 598046 441681 749673 157391 67813 ...
$ Intersex_Population : num [1:48] 30 18 25 2 4 7 34 49 37 18 ...
$ Total_Household : num [1:48] 378422 173176 298472 68242 37963 ...
$ Conventional_Household : num [1:48] 376295 172802 297990 66984 34231 ...
$ Group_Quarters : num [1:48] 2127 374 482 1258 3732 ...
$ Land_Area_SqKm : num [1:48] 220 8254 12553 37904 6283 ...
$ Density_Persons_per_SqKm: num [1:48] 5495.01 105.02 115.81 8.34 22.91 ...
OBSERVATIONS:
-
There are 10 columns as listed, all in the correct Data Types.
-
The data has 48 records.
Missing Values
Checking for rows with any null/Missing Values and viewing them
# Check for NAs in the entire pop_data data frame
na_rows <- pop_data[apply(is.na(pop_data), 1, any), ]
# View the rows with NAs in any column
print(na_rows)
OBSERVATION: Only one of the rows has missing
values. It is the last row that isn’t related to the data, so will drop
it.
pop_data <- na.omit(pop_data)
Previewing nulls again to confirm removal of the row
# Check for nulls in the entire pop_data data frame
na_rows <- pop_data[apply(is.na(pop_data), 1, any), ]
# View the rows with nulls in any column
print(na_rows)
OBSERVATION: - There’s no missing values left.
Getting Summary statistics of the data fields
# Summary statistics of the Data's columns
summary(pop_data)
County Total_Population Male_Population Female_Population Intersex_Population Total_Household Conventional_Household Group_Quarters
Length:47 Min. : 143920 Min. : 76103 Min. : 67813 Min. : 2.00 Min. : 37963 Min. : 34231 Min. : 77.0
Class :character 1st Qu.: 609505 1st Qu.: 303110 1st Qu.: 311291 1st Qu.: 18.00 1st Qu.: 141956 1st Qu.: 140409 1st Qu.: 610.5
Mode :character Median : 893681 Median : 450741 Median : 448868 Median : 25.00 Median : 204188 Median : 203576 Median : 1258.0
Mean :1012006 Mean : 501023 Mean : 510951 Mean : 32.43 Mean : 258381 Mean : 256234 Mean : 2146.7
3rd Qu.:1156724 3rd Qu.: 569992 3rd Qu.: 589759 3rd Qu.: 34.00 3rd Qu.: 302844 3rd Qu.: 299550 3rd Qu.: 2681.0
Max. :4397073 Max. :2192452 Max. :2204376 Max. :245.00 Max. :1506888 Max. :1494676 Max. :17809.0
Land_Area_SqKm Density_Persons_per_SqKm
Min. : 219.9 Min. : 6.481
1st Qu.: 2526.2 1st Qu.: 52.826
Median : 3325.0 Median : 220.377
Mean :12359.5 Mean : 509.163
3rd Qu.:14852.6 3rd Qu.: 415.876
Max. :70944.3 Max. :6246.995
Adding an extra column called Regions for better analysis
# Create 'Region' column based on 'County'
pop_data$Region <- ifelse(pop_data$County %in% c("MOMBASA", "KWALE", "KILIFI", "TANA RIVER", "LAMU", "TAITA/TAVETA"), "Coast",
ifelse(pop_data$County %in% "NAIROBI CITY", "Nairobi",
ifelse(pop_data$County %in% c("KIAMBU", "MURANG'A", "NYERI", "NYANDARUA", "KIRINYAGA"), "Central",
ifelse(pop_data$County %in% c("MANDERA", "WAJIR", "GARISSA", "MARSABIT"), "North Eastern",
ifelse(pop_data$County %in% c("TURKANA", "UASIN GISHU", "ELGEYO/MARAKWET", "KERICHO", "WEST POKOT",
"SAMBURU", "TRANS NZOIA", "BARINGO", "NANDI", "LAIKIPIA", "NAKURU",
"NAROK", "KAJIADO"), "Rift Valley",
ifelse(pop_data$County %in% c("KAKAMEGA", "VIHIGA", "BUNGOMA", "BUSIA"), "Western",
ifelse(pop_data$County %in% c("KISII", "NYAMIRA", "HOMA BAY", "MIGORI", "KISUMU", "SIAYA"), "Nyanza",
ifelse(pop_data$County %in% c("THARAKA-NITHI", "EMBU", "KITUI", "MAKUENI", "MERU", "ISIOLO", "MACHAKOS"), "Eastern",
ifelse(pop_data$County %in% c("KAJIADO", "NAROK", "KERICHO", "BOMET", "NYERI"), "Rift Valley",
"Other")))))))))
# Reorder columns with 'Region' as the first column
pop_data <- pop_data %>%
select(Region, everything())
# Check the resulting data frame
head(pop_data)
# Check unique values in the 'Region' column
unique(pop_data$Region)
[1] "Coast" "North Eastern" "Eastern" "Central" "Rift Valley" "Western" "Nyanza" "Nairobi"
OBSERVATION
All records are represented in the 8 regions listed here
EXPLORATORY DATA ANALYSIS
VISUALIZATIONS
Total Population by Region
# Bar Plot of Total Population by Region
plot_ly(data = pop_data,
x = ~Region,
y = ~Total_Population,
type = 'bar',
name = 'Total Population',
marker = list(color = 'blue')) %>%
layout(title = 'Total Population by Region',
xaxis = list(title = 'Region'),
yaxis = list(title = 'Total Population'))
OBSERVATIONS:
-
Rift valley region has the Highest Population.
-
North Eastern region has the lowest Population.
Top 10 Highest County Populations
#Barplot of Top 10 Highest County Populations
# Order the data by Total_Population in descending order
Total_Population_desc <- pop_data[order(pop_data$Total_Population, decreasing = TRUE), ]
# Select the top N counties (e.g., top 10)
top_counties <- head(Total_Population_desc, n = 10)
plot_ly(data = top_counties,
x = ~reorder(County, -Total_Population),
y = ~Total_Population,
type = 'bar',
name = 'Total Population',
marker = list(color = 'blue')) %>%
layout(title = 'Top Highest County Populations',
xaxis = list(title = 'County'),
yaxis = list(title = 'Total Population'))
OBSERVATIONS=:
-
Nairobi County has the Highest Population compared to other counties,
followed by Kiambu then Nakuru.
Bottom 10 (Low) County Populations
# Barplot of Bottom 10 County Populations
# Order the data by Total_Population in ascending order
pop_data_sorted <- pop_data[order(pop_data$Total_Population), ]
# Select the top N counties (e.g., bottom 10)
bottom_counties <- head(pop_data_sorted, n = 10)
# Create the plot with ordered bars in ascending order
plot_ly(data = bottom_counties,
x = ~reorder(County, Total_Population),
y = ~Total_Population,
type = 'bar',
name = 'Total Population',
marker = list(color = 'blue')) %>%
layout(title = 'Bottom Least County Populations',
xaxis = list(title = 'County'),
yaxis = list(title = 'Total Population'))
OBSERVATION:
-
Lamu County has the Lowest Population compared to other counties,
followed by Isiolo then Samburu from the bottom.
Population by Gender
# Pie Chart of Male, Female and Intersex Population percentages
pop_data %>%
plot_ly(labels = ~c("Male", "Female", "Intersex"),
values = ~c(sum(pop_data$Male_Population), sum(pop_data$Female_Population), sum(pop_data$Intersex_Population)),
type = 'pie',
marker = list(colors = c("blue", "pink", "red"))) %>%
layout(title = "Population Distribution by Gender")
Total Households by Region
# Bar Plot of Total Population by Region
plot_ly(data = pop_data,
x = ~Region,
y = ~Total_Household,
type = 'bar',
name = 'Total Households by Region',
marker = list(color = 'blue')) %>%
layout(title = 'Total Households by Region',
xaxis = list(title = 'Region'),
yaxis = list(title = 'Total Households for Region'))
OBSERVATION:
-
Rift Valley has the Highest number of Households while North Eastern has
the least.
Comparison of Group Quarters and Conventional Household
# Pie Chart of Male, Female and Intersex Population percentages
pop_data %>%
plot_ly(labels = ~c("Conventional_Household", "Group_Quarters"),
values = ~c(sum(pop_data$Conventional_Household), sum(pop_data$Group_Quarters)),
type = 'pie',
marker = list(colors = c("blue", "red"))) %>%
layout(title = "Comparison of Group Quarters and Conventional Household")
OBSERVATION:
-
Conventional households are more common than Group Quarters.
Comparison of Group Quarters and Conventional Household by
Region
# Grouped Bar plot showing Comparison of Group Quarters and Conventional Household by Region
plot_ly(data = pop_data,
x = ~Region,
y = ~Group_Quarters,
type = 'bar',
name = 'Group Quarters',
marker = list(color = 'red')) %>%
add_trace(y = ~Conventional_Household,
name = 'Conventional Household',
marker = list(color = 'blue')) %>%
layout(title = 'Comparison of Group Quarters and Conventional Household by Region',
xaxis = list(title = 'Region'),
yaxis = list(title = 'Count'),
barmode = 'group')
OBSERVATIONS:
-
All Kenyan Regions have both Conventional Households and Group Quarters
-
Conventional Households are more common that Group Quarters
-
Rift Valley has the Highest Number of both Conventional Households and
Group Quarters
Household vs. Land Area
# Scatter Plot of Household vs. Land Area
plot_ly(
data = pop_data,
x = ~Total_Household,
y = ~Land_Area_SqKm,
color = ~Region, # Specify the variable for coloring
type = 'scatter',
mode = 'markers'
) %>%
layout(
title = 'Household vs. Land Area by Region',
xaxis = list(title = 'Total Household'),
yaxis = list(title = 'Land Area (Sq Km)')
)
OBSERVATION:
-
Most parts of Kenya have low Land area and low Household numbers.
Population Density Distribution
# Histogram of Population Density
# Calculate density values
density_values <- density(pop_data$Density_Persons_per_SqKm)
# Create the combined plot
combined_plot <- plot_ly() %>%
add_histogram(x = ~pop_data$Density_Persons_per_SqKm, nbinsx = 20, histnorm = 'probability density', marker = list(color = 'blue', line = list(color = 'black', width = 1))) %>%
add_lines(x = density_values$x, y = density_values$y, type = 'scatter', mode = 'lines', line = list(color = 'red', width = 2)) %>%
layout(title = 'Population Density Histogram with KDE',
xaxis = list(title = 'Density (Persons per Sq Km)'),
yaxis = list(title = 'Probability Density'))
# Display the combined plot
combined_plot
OBSERVATION:
-
Most Kenyan areas have relatively low population density with a few
exceptions where population density is significantly high.
Narrowing down to comparison in different regions will help gather
more insights.
Population Density by Region
# Box Plot
plot_ly(data = pop_data, x = ~Region, y = ~Density_Persons_per_SqKm, type = "box") %>%
layout(title = "Population Density by Region",
xaxis = list(title = "Region"),
yaxis = list(title = "Density (Persons per Sq Km)"))
OBSERVATION:
-
Most regions have relatively low density ranging from 0 to slightly
above 1000000 persons per square Km but the entire Nairobi Region has
very high density. Coast Region has an outlier that has very high
population density of 5495008 persons per Sq Km.
-
In most regions, the median is closer to the lower values. This means
that most values are in the lower range. This emphasizes the low
population density in most parts of Kenya.
A table showing density for each county in coast region will help
understand the outlier in the region better.
# filter coast data only
coast_data <- pop_data[pop_data$Region == "Coast", ]
# Assuming your data frame is named 'pop_data'
coast_data <- pop_data[pop_data$Region == "Coast", ]
# Create a table of density values
coast_density_table <- coast_data %>%
select(County, Density_Persons_per_SqKm) %>%
kable("html", caption = "Density Values for Counties in the Coast Region") %>%
kable_styling()
# Display the table
coast_density_table
Density Values for Counties in the Coast Region
| County |
Density_Persons_per_SqKm |
| MOMBASA |
5495.00838 |
| KWALE |
105.02256 |
| KILIFI |
115.80947 |
| TANA RIVER |
8.33543 |
| LAMU |
22.90616 |
| TAITA/TAVETA |
19.86188 |
OBSERVATION:
-
From the coast Region, Mombasa is the outlier with very high density
compared to the other areas.
FINDINGS
-
Rift valley region has the Highest Population and highest number of
Households.
-
North Eastern region has the lowest Population and lowest number of
Households.
-
Most Kenyans identify as male or female. With slightly more females than
males. A very small percentage(0.0032%) identify as intersex.
-
Nairobi County has the Highest Population compared to other counties,
followed by Kiambu then Nakuru.
-
Lamu County has the Lowest Population compared to other counties,
followed by Isiolo then Samburu from the bottom.
-
All Kenyan Regions have both Conventional Households and Group Quarters
but Conventional Households are more common that Group Quarters.
-
Most Kenyan counties have relatively low Population Density with an
exception of Mombasa and Nairobi which have very High Population
Density(Persons per Square Kilometer) compared to the other counties.
