About the Data

The data is from Kenya National Bureau of Statistics Data Tables.

It is about the Population Distribution of Households by County according to the 2019 Kenya Population and Housing Census.

It contains the following fields;

Steps Taken

  1. Importing Necessary R Packages
  2. Loading Data to R
  3. Understanding the Data
  4. Data Cleaning and Preparation - Missing Values
  5. Exploratory Data Analysis - Visualizations
  6. Findings

Importing necessary packages

library(readxl) # To read excel data into R
library(tidyverse) # To tidy data
library(plotly) # To create interactive visualizations
library(knitr) # For report generation
library(kableExtra) # To display table
library(dplyr) # For data analysis

Load and Preview the Data

# Load data to a dataFrame
pop_data <- read_excel("C:/Users/KNBS/Documents/PROJECTS/Education Loans R Shiny/Population-households-density-by-county.xlsx", skip = 4)

# Preview the Data
head(pop_data)

Formatting Columns

Renaming column Headers

colnames(pop_data) <- c("County", "Total_Population", "Male_Population",
                        "Female_Population", "Intersex_Population",
                        "Total_Household", "Conventional_Household",
                        "Group_Quarters", "Land_Area_SqKm",
                        "Density_Persons_per_SqKm")
# Previewing the result
head(pop_data)

Understanding the structure of the Data

str(pop_data)
tibble [48 × 10] (S3: tbl_df/tbl/data.frame)
 $ County                  : chr [1:48] "MOMBASA" "KWALE" "KILIFI" "TANA RIVER" ...
 $ Total_Population        : num [1:48] 1208333 866820 1453787 315943 143920 ...
 $ Male_Population         : num [1:48] 610257 425121 704089 158550 76103 ...
 $ Female_Population       : num [1:48] 598046 441681 749673 157391 67813 ...
 $ Intersex_Population     : num [1:48] 30 18 25 2 4 7 34 49 37 18 ...
 $ Total_Household         : num [1:48] 378422 173176 298472 68242 37963 ...
 $ Conventional_Household  : num [1:48] 376295 172802 297990 66984 34231 ...
 $ Group_Quarters          : num [1:48] 2127 374 482 1258 3732 ...
 $ Land_Area_SqKm          : num [1:48] 220 8254 12553 37904 6283 ...
 $ Density_Persons_per_SqKm: num [1:48] 5495.01 105.02 115.81 8.34 22.91 ...
OBSERVATIONS:

Missing Values

Checking for rows with any null/Missing Values and viewing them

# Check for NAs in the entire pop_data data frame
na_rows <- pop_data[apply(is.na(pop_data), 1, any), ]

# View the rows with NAs in any column
print(na_rows)

OBSERVATION: Only one of the rows has missing values. It is the last row that isn’t related to the data, so will drop it.

pop_data <- na.omit(pop_data)

Previewing nulls again to confirm removal of the row

# Check for nulls in the entire pop_data data frame
na_rows <- pop_data[apply(is.na(pop_data), 1, any), ]

# View the rows with nulls in any column
print(na_rows)

OBSERVATION: - There’s no missing values left.

Getting Summary statistics of the data fields

# Summary statistics of the Data's columns
summary(pop_data)
    County          Total_Population  Male_Population   Female_Population Intersex_Population Total_Household   Conventional_Household Group_Quarters   
 Length:47          Min.   : 143920   Min.   :  76103   Min.   :  67813   Min.   :  2.00      Min.   :  37963   Min.   :  34231        Min.   :   77.0  
 Class :character   1st Qu.: 609505   1st Qu.: 303110   1st Qu.: 311291   1st Qu.: 18.00      1st Qu.: 141956   1st Qu.: 140409        1st Qu.:  610.5  
 Mode  :character   Median : 893681   Median : 450741   Median : 448868   Median : 25.00      Median : 204188   Median : 203576        Median : 1258.0  
                    Mean   :1012006   Mean   : 501023   Mean   : 510951   Mean   : 32.43      Mean   : 258381   Mean   : 256234        Mean   : 2146.7  
                    3rd Qu.:1156724   3rd Qu.: 569992   3rd Qu.: 589759   3rd Qu.: 34.00      3rd Qu.: 302844   3rd Qu.: 299550        3rd Qu.: 2681.0  
                    Max.   :4397073   Max.   :2192452   Max.   :2204376   Max.   :245.00      Max.   :1506888   Max.   :1494676        Max.   :17809.0  
 Land_Area_SqKm    Density_Persons_per_SqKm
 Min.   :  219.9   Min.   :   6.481        
 1st Qu.: 2526.2   1st Qu.:  52.826        
 Median : 3325.0   Median : 220.377        
 Mean   :12359.5   Mean   : 509.163        
 3rd Qu.:14852.6   3rd Qu.: 415.876        
 Max.   :70944.3   Max.   :6246.995        

Adding an extra column called Regions for better analysis

# Create 'Region' column based on 'County'
pop_data$Region <- ifelse(pop_data$County %in% c("MOMBASA", "KWALE", "KILIFI", "TANA RIVER", "LAMU", "TAITA/TAVETA"), "Coast",
                          ifelse(pop_data$County %in% "NAIROBI CITY", "Nairobi",
                          ifelse(pop_data$County %in% c("KIAMBU", "MURANG'A", "NYERI", "NYANDARUA", "KIRINYAGA"), "Central", 
                          ifelse(pop_data$County %in% c("MANDERA", "WAJIR", "GARISSA", "MARSABIT"), "North Eastern",
                          ifelse(pop_data$County %in% c("TURKANA", "UASIN GISHU", "ELGEYO/MARAKWET", "KERICHO", "WEST POKOT",
                                                        "SAMBURU", "TRANS NZOIA", "BARINGO", "NANDI", "LAIKIPIA", "NAKURU",
                                                        "NAROK", "KAJIADO"), "Rift Valley",
                          ifelse(pop_data$County %in% c("KAKAMEGA", "VIHIGA", "BUNGOMA", "BUSIA"), "Western",
                          ifelse(pop_data$County %in% c("KISII", "NYAMIRA", "HOMA BAY", "MIGORI", "KISUMU", "SIAYA"), "Nyanza",
                          ifelse(pop_data$County %in% c("THARAKA-NITHI", "EMBU", "KITUI", "MAKUENI", "MERU", "ISIOLO", "MACHAKOS"), "Eastern",
                          ifelse(pop_data$County %in% c("KAJIADO", "NAROK", "KERICHO", "BOMET", "NYERI"), "Rift Valley",
                          "Other")))))))))
# Reorder columns with 'Region' as the first column
pop_data <- pop_data %>%
  select(Region, everything())

# Check the resulting data frame
head(pop_data)
# Check unique values in the 'Region' column
unique(pop_data$Region)
[1] "Coast"         "North Eastern" "Eastern"       "Central"       "Rift Valley"   "Western"       "Nyanza"        "Nairobi"      
OBSERVATION

All records are represented in the 8 regions listed here

EXPLORATORY DATA ANALYSIS

VISUALIZATIONS

Total Population by Region

# Bar Plot of Total Population by Region
plot_ly(data = pop_data,
        x = ~Region,
        y = ~Total_Population,
        type = 'bar',
        name = 'Total Population',
        marker = list(color = 'blue')) %>%
  
  layout(title = 'Total Population by Region',
         xaxis = list(title = 'Region'),
         yaxis = list(title = 'Total Population'))
OBSERVATIONS:
  • Rift valley region has the Highest Population.
  • North Eastern region has the lowest Population.

Top 10 Highest County Populations

#Barplot of Top 10 Highest County Populations

# Order the data by Total_Population in descending order
Total_Population_desc <- pop_data[order(pop_data$Total_Population, decreasing = TRUE), ]

# Select the top N counties (e.g., top 10)
top_counties <- head(Total_Population_desc, n = 10)

plot_ly(data = top_counties,
        x = ~reorder(County, -Total_Population),
        y = ~Total_Population,
        type = 'bar',
        name = 'Total Population',
        marker = list(color = 'blue')) %>%
    layout(title = 'Top Highest County Populations',
         xaxis = list(title = 'County'),
         yaxis = list(title = 'Total Population'))
OBSERVATIONS=:
  • Nairobi County has the Highest Population compared to other counties, followed by Kiambu then Nakuru.

Bottom 10 (Low) County Populations

# Barplot of Bottom 10 County Populations

# Order the data by Total_Population in ascending order
pop_data_sorted <- pop_data[order(pop_data$Total_Population), ]

# Select the top N counties (e.g., bottom 10)
bottom_counties <- head(pop_data_sorted, n = 10)

# Create the plot with ordered bars in ascending order
plot_ly(data = bottom_counties,
        x = ~reorder(County, Total_Population),
        y = ~Total_Population,
        type = 'bar',
        name = 'Total Population',
        marker = list(color = 'blue')) %>%
  
  layout(title = 'Bottom Least County Populations',
         xaxis = list(title = 'County'),
         yaxis = list(title = 'Total Population'))
OBSERVATION:
  • Lamu County has the Lowest Population compared to other counties, followed by Isiolo then Samburu from the bottom.

Population by Gender

# Pie Chart of Male, Female and Intersex Population percentages
pop_data %>%
  plot_ly(labels = ~c("Male", "Female", "Intersex"),
          values = ~c(sum(pop_data$Male_Population), sum(pop_data$Female_Population), sum(pop_data$Intersex_Population)),
          type = 'pie',
          marker = list(colors = c("blue", "pink", "red"))) %>%
  layout(title = "Population Distribution by Gender")

Total Households by Region

# Bar Plot of Total Population by Region

plot_ly(data = pop_data,
        x = ~Region,
        y = ~Total_Household,
        type = 'bar',
        name = 'Total Households by Region',
        marker = list(color = 'blue')) %>%
    layout(title = 'Total Households by Region',
         xaxis = list(title = 'Region'),
         yaxis = list(title = 'Total Households for Region'))
OBSERVATION:
  • Rift Valley has the Highest number of Households while North Eastern has the least.

Comparison of Group Quarters and Conventional Household

# Pie Chart of Male, Female and Intersex Population percentages

pop_data %>%
  plot_ly(labels = ~c("Conventional_Household", "Group_Quarters"),
          values = ~c(sum(pop_data$Conventional_Household), sum(pop_data$Group_Quarters)),
          type = 'pie',
          marker = list(colors = c("blue", "red"))) %>%
  layout(title = "Comparison of Group Quarters and Conventional Household")
OBSERVATION:
  • Conventional households are more common than Group Quarters.

Comparison of Group Quarters and Conventional Household by Region

# Grouped Bar plot showing Comparison of Group Quarters and Conventional Household by Region

plot_ly(data = pop_data,
        x = ~Region,
        y = ~Group_Quarters,
        type = 'bar',
        name = 'Group Quarters',
        marker = list(color = 'red')) %>%
  
  add_trace(y = ~Conventional_Household,
            name = 'Conventional Household',
            marker = list(color = 'blue')) %>%
  
  layout(title = 'Comparison of Group Quarters and Conventional Household by Region',
         xaxis = list(title = 'Region'),
         yaxis = list(title = 'Count'),
         barmode = 'group')
OBSERVATIONS:
  • All Kenyan Regions have both Conventional Households and Group Quarters
  • Conventional Households are more common that Group Quarters
  • Rift Valley has the Highest Number of both Conventional Households and Group Quarters

Household vs. Land Area

# Scatter Plot of Household vs. Land Area

plot_ly(
  data = pop_data,
  x = ~Total_Household,
  y = ~Land_Area_SqKm,
  color = ~Region,  # Specify the variable for coloring
  type = 'scatter',
  mode = 'markers'
) %>%
  layout(
    title = 'Household vs. Land Area by Region',
    xaxis = list(title = 'Total Household'),
    yaxis = list(title = 'Land Area (Sq Km)')
  )
OBSERVATION:
  • Most parts of Kenya have low Land area and low Household numbers.

Population Density Distribution

# Histogram of Population Density
# Calculate density values
density_values <- density(pop_data$Density_Persons_per_SqKm)

# Create the combined plot
combined_plot <- plot_ly() %>%
  add_histogram(x = ~pop_data$Density_Persons_per_SqKm, nbinsx = 20, histnorm = 'probability density', marker = list(color = 'blue', line = list(color = 'black', width = 1))) %>%
  add_lines(x = density_values$x, y = density_values$y, type = 'scatter', mode = 'lines', line = list(color = 'red', width = 2)) %>%
  layout(title = 'Population Density Histogram with KDE',
         xaxis = list(title = 'Density (Persons per Sq Km)'),
         yaxis = list(title = 'Probability Density'))

# Display the combined plot
combined_plot
OBSERVATION:
  • Most Kenyan areas have relatively low population density with a few exceptions where population density is significantly high.

Narrowing down to comparison in different regions will help gather more insights.

Population Density by Region

# Box Plot
plot_ly(data = pop_data, x = ~Region, y = ~Density_Persons_per_SqKm, type = "box") %>%
  layout(title = "Population Density by Region",
         xaxis = list(title = "Region"),
         yaxis = list(title = "Density (Persons per Sq Km)"))
OBSERVATION:
  • Most regions have relatively low density ranging from 0 to slightly above 1000000 persons per square Km but the entire Nairobi Region has very high density. Coast Region has an outlier that has very high population density of 5495008 persons per Sq Km.
  • In most regions, the median is closer to the lower values. This means that most values are in the lower range. This emphasizes the low population density in most parts of Kenya.

A table showing density for each county in coast region will help understand the outlier in the region better.

# filter coast data only
coast_data <- pop_data[pop_data$Region == "Coast", ]

# Assuming your data frame is named 'pop_data'
coast_data <- pop_data[pop_data$Region == "Coast", ]

# Create a table of density values
coast_density_table <- coast_data %>%
  select(County, Density_Persons_per_SqKm) %>%
  kable("html", caption = "Density Values for Counties in the Coast Region") %>%
  kable_styling()

# Display the table
coast_density_table
Density Values for Counties in the Coast Region
County Density_Persons_per_SqKm
MOMBASA 5495.00838
KWALE 105.02256
KILIFI 115.80947
TANA RIVER 8.33543
LAMU 22.90616
TAITA/TAVETA 19.86188
OBSERVATION:
  • From the coast Region, Mombasa is the outlier with very high density compared to the other areas.

FINDINGS

