Author

D K. Muriithi | CDAM-Chuka University

Published

May 4, 2026

Session 1: Introduction to R for Data Science

1.1 What is R?

R is an open-source programming language for statistical computing and visualization. It is widely used in data science, research, and analytics.

R is a programming language designed for:

  • Statistical analysis

  • Data visualization

  • Data manipulation

  • Reproducible research

It is widely used in:

  • Academic research

  • Data science and machine learning

  • Business analytics

Key Characteristics

  • Open-source (free to use)

  • Strong statistical capabilities

  • Thousands of packages (e.g., dplyr, ggplot2)

  • Supports reproducible research (via R Markdown / Quarto)

1.2 What is RStudio?

RStudio is an Integrated Development Environment (IDE) for R. It makes working with R easier and more organized.

Main Components of RStudio

When you open RStudio, you will see four panels:

  1. Source (Script Editor)
  • Write and save your code

  • Recommended for all serious work

  1. Console
  • Execute R commands directly

  • Immediate output

  1. Environment/History
  • Shows stored variables and datasets

  • Tracks command history

  1. Files/Plots/Packages/Help
  • Files: navigate directories

  • Plots: display graphs

  • Packages: install/load libraries

  • Help: documentation

1.3 Installing R

Step 1: Download R

Go to the official repository:
👉 https://cran.r-project.org/

Choose your operating system:

  • Windows

  • macOS

  • Linux

Step 2: Install R (Windows example)

  1. Click Download R for Windows

  2. Click base

  3. Download the latest version

  4. Run the .exe file

  5. Follow installation prompts (default settings are fine)

Step 3: Verify Installation

Open R (or RStudio later) and run:

#version

If R is installed correctly, version details will appear.

1.4 Installing RStudio

Step 1: Download RStudio

Go to:
👉 https://posit.co/download/rstudio-desktop/

Choose:

  • RStudio Desktop (Free version)

Step 2: Install

  1. Download installer

  2. Run setup file

  3. Follow installation steps

Step 3: Launch RStudio

  • Open RStudio

  • It should automatically detect your R installation

1.5 What is Data Science?

Data science is an interdisciplinary field that combines statistics, computational tools and domain knowledge to extract meaningful insights and knowledge from data.

1.5.1 Key Components of Data Science

Statistics and Mathematics

  • Foundation for inference and modeling

  • Examples: hypothesis testing, regression, probability

Programming and Computing

  • Tools to manipulate and analyze data

  • Common languages: R, Python

Domain Knowledge

  • Understanding the problem context (e.g., healthcare, finance)

  • Ensures results are meaningful and actionable

1.5.2 Why R for Data Science?

  • Strong statistical foundation

  • Rich ecosystem (dplyr, ggplot2, caret)

  • Ideal for research and modeling

1.5.3 Core Idea

At its core, data science answers questions like:

  • What is happening? (descriptive analysis)

  • Why is it happening? (diagnostic analysis)

  • What will happen next? (predictive modeling)

  • What should we do? (prescriptive analytics)

1.5.4 Data Science Workflow (Step-by-Step)

Step 1: Problem Definition

  • Clearly define the research or business question

Step 2: Data Collection

  • Gather data from sources (databases, APIs, surveys, sensors)

Step 3: Data Cleaning

  • Handle missing values, errors, inconsistencies

Step 4: Exploratory Data Analysis (EDA)

  • Summarize and visualize data

  • Identify patterns, trends, outliers

Step 5: Modeling

  • Apply statistical or machine learning models

  • Examples: regression, classification, clustering

Step 6: Evaluation

  • Assess model performance

  • Metrics: accuracy, RMSE, AUC

Step 7: Communication

  • Present findings using reports, dashboards, or visualizations

1.5.5 Types of Data Science Tasks

  • Classification → Predict categories (e.g., disease vs no disease)

  • Regression → Predict continuous values (e.g., price, temperature)

  • Clustering → Group similar observations

  • Time Series Analysis → Analyze data over time

  • Natural Language Processing (NLP) → Work with text data

1.5.6 Applications of Data Science

Healthcare

  • Disease prediction

  • Patient risk modeling

Finance

  • Fraud detection

  • Credit scoring

Business

  • Customer segmentation

  • Sales forecasting

Agriculture

  • Crop prediction

  • Pest/disease detection

Social Media

  • Sentiment analysis

  • Recommendation systems

1.5.7 Tools Used in Data Science

  • Programming: R, Python

  • Visualization: ggplot2, Tableau, Power BI

  • Databases: SQL

  • Machine Learning: caret, scikit-learn (in python), XGBoost

1.5.8 Simple Example

Problem: Predict house prices

Steps:

  1. Collect data (size, location, price)

  2. Clean dataset

  3. Explore relationships

  4. Fit regression model

  5. Evaluate model accuracy

  6. Predict new house prices

1.5.9 Summary

Data science combines:

  • Statistics (to reason about data)

  • Programming (to process data)

  • Domain knowledge (to interpret results)

Its goal is to transform raw data into useful insights and decisions.

1.6 Getting Started

Step 1: Open RStudio

You will see:

Creating an R script

Click File ->New File -> R Script

The script editor opens with an empty script

Run Your First Command

In the Console:

# Example 1: Print a message
message("Hello, R!")
Hello, R!
print("I am a boy"); print("And i go to school"); print("And i love sharing")
[1] "I am a boy"
[1] "And i go to school"
[1] "And i love sharing"

Create a Script

  1. Click File → New File → R Script

  2. Type:

x <- 10
y <- 5
x + y
[1] 15

Step 2: Basic Commands

# Arithmetic
2 + 2      # addition
[1] 4
5 * 3      # multiplication  
[1] 15
10 / 2     # division 
[1] 5
sqrt(16)   # square root
[1] 4
log(10)    # logarithm 
[1] 2.302585

Assigning Variables

x <- 10
y <- 5
x + y
[1] 15

Data Types (Core Structures)

Vectors

A vector is the most basic data structure in R. It is a one-dimensional collection of elements where all elements must be of the same data type.

Think of it as a single column of data.

# Create a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
print(numeric_vector)
[1] 1 2 3 4 5
# Create a character vector
character_vector <- c("apple", "banana", "cherry")
print(character_vector)
[1] "apple"  "banana" "cherry"
# Access elements of a vector
print(numeric_vector[1])       # Output: 1
[1] 1
print(numeric_vector[2:4])     # Second to fourth elements
[1] 2 3 4
print(character_vector[2])          # Output: "banana"
[1] "banana"
# Modify a vector
numeric_vector[3] <- 10        # Change the third element to 10
print(numeric_vector)
[1]  1  2 10  4  5
# Vector operations
sum_vector <- numeric_vector + c(1, 1, 1, 1, 1)  # Add 1 to each element
print(sum_vector)
[1]  2  3 11  5  6
# Vector arithmetic
vector_a <- c(1, 2, 3)
vector_b <- c(4, 5, 6)
result <- vector_a + vector_b
print(result)  # Output: 5 7 9
[1] 5 7 9
# Logical vector
flags <- c(TRUE, FALSE, TRUE)

Matrices

A matrix in R is a two-dimensional data structure that stores data in rows and columns, where all elements must be of the same data type.

Think of it as a table of numbers used mainly for mathematical and statistical computations.

# Create a matrix
matrix_data <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
print(matrix_data)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
# Access elements of a matrix
print(matrix_data[1, 2])       # Element in the first row, second column
[1] 3
print(matrix_data[, 3])        # Third column
[1] 5 6
print(matrix_data[2, ])        # Second row
[1] 2 4 6
# Matrix operations
matrix_transpose <- t(matrix_data)  # Transpose the matrix
print(matrix_transpose)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
# Matrix multiplication
matrix_a <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
matrix_b <- matrix(c(5, 6, 7, 8), nrow = 2, ncol = 2)
result <- matrix_a %*% matrix_b
print(result)  # Output: [[19, 22], [43, 50]]
     [,1] [,2]
[1,]   23   31
[2,]   34   46

Lists

A list is a flexible data structure that can store different types of objects together in one container.

# Example 1: Create a list
my_list <- list(name = "Alice", age = 25, scores = c(85, 90, 95))
print(my_list)
$name
[1] "Alice"

$age
[1] 25

$scores
[1] 85 90 95
# Access elements of a list
print(my_list$name)           # Access the "name" element
[1] "Alice"
# Modify a list
my_list$scores <- c(80, 85, 90)  # Update the "scores" element
print(my_list)
$name
[1] "Alice"

$age
[1] 25

$scores
[1] 80 85 90
# Example 2: Create a list
my_list <- list(name = "John", age = 30, hobbies = c("reading", "coding"))
print(my_list)
$name
[1] "John"

$age
[1] 30

$hobbies
[1] "reading" "coding" 
# Example 3: Access elements of a list
print(my_list$name)       # Output: "John"
[1] "John"
print(my_list$hobbies[1]) # Output: "reading"
[1] "reading"

Data Frames:

A data frame in R is a two-dimensional data structure used to store data in tabular form, similar to a spreadsheet or SQL table.

  • Rows → observations (records)

  • Columns → variables (features)

Each column can contain different data types (numeric, character, factor, etc.), but within a column, all values must be of the same type.

# Example 1: Create a data frame
df <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                 Age = c(25, 30, 35),
                 Salary = c(50000, 60000, 70000))
print(df)
     Name Age Salary
1   Alice  25  50000
2     Bob  30  60000
3 Charlie  35  70000
# Example 2: Access columns of a data frame
print(df$Name)    # Output: "Alice" "Bob" "Charlie"
[1] "Alice"   "Bob"     "Charlie"
print(df$Age[2])  # Output: 30
[1] 30
# Example 3: Add a new column
df$City <- c("New York", "Los Angeles", "Chicago")
print(df)
     Name Age Salary        City
1   Alice  25  50000    New York
2     Bob  30  60000 Los Angeles
3 Charlie  35  70000     Chicago

Basic Functions

x <- c(1:10, 1:5, NA) 
x
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5 NA
length(x)
[1] 16
max(x,na.rm = TRUE)
[1] 10
max(x, na.rm = TRUE)        # Find the maximum value in x, excluding missing values
[1] 10
min(x, na.rm = TRUE)        # minimum
[1] 1
mean(x, na.rm = TRUE)       # mean
[1] 4.666667
median(x, na.rm = T)        # median 
[1] 4
sum(x, na.rm = T)           # sum
[1] 70
var(x, na.rm = T)           # variance
[1] 8.095238
sd(x, na.rm = T)            # standard deviation
[1] 2.845213
summary(x)                  
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   2.500   4.000   4.667   6.500  10.000       1 
table(x)                    # Frequency counts of entries
x
 1  2  3  4  5  6  7  8  9 10 
 2  2  2  2  2  1  1  1  1  1 
length(x)                   # length of x 
[1] 16
is.na(x)                    # check if each element in x is missing
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE  TRUE
which(is.na(x))             # the index/location of missing value in the vector x
[1] 16
which(x == 1)               # the index/ location of a particular value in the vector x
[1]  1 11

To save: click File -> Save as

Exercise

  • Create a vector of 10 random numbers

  • Compute mean and standard deviation

set.seed(123)
x <- rnorm(10)
mean(x)
[1] 0.07462564
sd(x)
[1] 0.9537841

Installing and Loading Packages

Packages extend R functionality.

Install a package (run once)

#install.packages("ggplot2")
#install.packages("tidy verse")

Load a package (every session)

library(ggplot2)

Working Directory (Important Concept)

The working directory is where R reads and saves files.

Check current directory

#getwd()

Set directory

#setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")

(In RStudio, you can also use: Session → Set Working Directory)

Common Beginner Mistakes

  • Running code without saving scripts

  • Not setting working directory

  • Forgetting to load packages (library())

  • Confusing assignment <- with =

Core Difference

<- (Assignment Operator)

  • Standard and preferred for assigning values to variables

  • Explicitly indicates assignment

x <- 10

= (Assignment or Argument Matching)

  • Can assign values but also used to pass arguments to functions

  • Context-dependent → can introduce ambiguity

x = 10

Session 2: Data Import, Cleaning and EDA

2.1 Overview

This stage is the foundation of any data science workflow. Poor data quality leads to invalid models and misleading conclusions.

Objectives

  • Import data correctly

  • Clean and prepare datasets

  • Understand structure and quality

  • Explore patterns using statistics and visualization

2.2 Data Importation

Data importation is the first step in any data analysis workflow. In Python, the pandas library is widely used for loading data from various sources, such as CSV, Excel, SQL databases, JSON, and even web scraping. Proper data importation ensures that the dataset is structured correctly and is ready for further preprocessing This comprehensive example demonstrates:

  • CSV Files: The most common format for storing tabular data, where values are separated by comms.

  • Excel Files: Useful for structured data with multiple s Using pandas’ read_excel.ts.

  • JSON and XML Files: Common in web applications and PIs.

  • SQL Databases

2.3. Data Cleaning

Real-world datasets often contain missing values, duplicates, and inconsistencies that must be addressed before analysis. Data cleaning ensures that the dataset is structured correctly and free from errors that might affect the accuracy of the results.

Handling Missing Data

Missing values can occur due to data entry errors, incomplete records, or system issues. There are several ways to handle missing data:

  • Deletion: Removing rows or columns with missing values if they are minimal.

  • Imputation: Replacing missing values with statistical measures such as the mean, median, or mode.

  • Forward or Backward Fill: Filling missing values using previous or next available values in time-series data.

Handling Outliers

Outliers are extreme values that differ significantly from the rest of the data and can distort analysis results. They can be detected using statistical methods such as:

  • Z-score method: Identifies data points that are several standard deviations away from the mean.

  • Interquartile Range (IQR) method: Identifies outliers based on quartiles.

  • Visualization methods: Box plots and scatter plots help in detecting extreme values.

2.4 Summary Statistics

Summary statistics provide insights into the distribution, central tendency, and spread of the data. Some of the key statistical measures include:

  • Mean: The average value of a dataset.

  • Median: The middle value when the data is sorted.

  • Mode: The most frequently occurring value.

  • Variance and Standard Deviation: Measures the spread of data around the mean.

  • Skewness and Kurtosis: Used to understand the shape of the distribution.

These statistics help analysts understand the nature of the dataset and whether further transformations are necessary before modeling.

2.5 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their key characteristics, often with the help of visualizations. It helps to identify patterns, trends, and potential issues within the data.

Data Visualization

Visualization is a key part of EDA, as it provides an intuitive understanding of data relationships. Common types of visualizations include:

  • Histograms: Show the distribution of numerical data.

  • Box Plots: Identify outliers and the spread of data.

  • Scatter Plots: Show relationships between two numerical variables.

  • Bar Charts: Compare categorical data.

  • Heatmaps: Display correlations between multiple numerical variables.

Using Matplotlib and Seaborn, analysts can create these visualizations to better understand the data before applying statistical or machine learning models.

2.6 Importance of EDA

EDA is a critical step before building models, as it helps in:it-

  • Understanding the data structure and identifying inconsistencies.

  • Detecting missing values, outliers, and unusual patterns.

  • Selecting appropriate features for predictive modeling. Improving data preprocessing and transformation steps.

  • Summarize key characteristics of a dataset.

  • Improving data preprocessing and transformation steps.

NB: By the end of this session, learners will be able to import datasets, clean data by handling missing values and outliers, compute summary statistics, and create visualizations for exploratory data analysis.

2.7 Getting started

Data importation

Set directory

setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")

Data importation

# Example 1: Importing an Excel file
# library(readxl)
# data_excel <- read_excel("data.xlsx")
# print(head(data_excel))

# Example 2: Importing a CSV file

gss <- read.csv("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING/GSSsubset.csv")

Examine the dataset

#View(gss)               # Note: Upper case V in View()
#print(head(gss))                # return the first parts of the dataset 

#dim(gss)                 # 1st: row number; 2nd: column number
#nrow(gss)                # number of rows
#ncol(gss)                # number of columns

#colnames(gss)            # name of columns (the variable names) in dataset  
#rownames(gss)            # row name

Handling Missing Data

# Example 1: Identify missing values
missing_values <- sum(is.na(gss))
print(missing_values)
[1] 0
# Example 2: Remove rows with missing values
clean_data <- na.omit(gss)
print(head(clean_data))
  id    sex         degree   income  marital age height weight hrswrk
1  1   MALE       BACHELOR 60967.50 DIVORCED  53     72    190     60
2  2 FEMALE       BACHELOR 60967.50  MARRIED  26     60     97     40
3  4 FEMALE       BACHELOR 10161.25  MARRIED  56     68    160     20
4 14 FEMALE    HIGH SCHOOL 17551.25  MARRIED  40     65    156     37
5 16   MALE    HIGH SCHOOL 17551.25  MARRIED  56     66    210      6
6 19   MALE LT HIGH SCHOOL 15703.75  MARRIED  51     68    170     50
# Example 3: Impute missing values with the mean
gss$age[is.na(gss$age)] <- mean(gss$age, na.rm = TRUE)
#print(head(gss))

Handling Outliers

# Example 1: Detect outliers using boxplot
boxplot(gss$income, main = "Income Distribution")

# Example 2: Remove outliers using IQR
Q1 <- quantile(gss$income, 0.25)
Q3 <- quantile(gss$income, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.2 * IQR
upper_bound <- Q3 + 1.2 * IQR

data_cleaned <- gss[gss$income >= lower_bound & gss$income <= upper_bound, ]
#print(head(data_cleaned))

Summary Statistics

# Example 1: Summary statistics
summary(gss)
       id             sex               degree              income        
 Min.   :   1.0   Length:994         Length:994         Min.   :   369.5  
 1st Qu.: 648.2   Class :character   Class :character   1st Qu.: 15703.8  
 Median :1254.5   Mode  :character   Mode  :character   Median : 27712.5  
 Mean   :1271.2                                         Mean   : 36887.2  
 3rd Qu.:1915.8                                         3rd Qu.: 49882.5  
 Max.   :2538.0                                         Max.   :158657.0  
   marital               age            height          weight     
 Length:994         Min.   :19.00   Min.   :57.00   Min.   : 90.0  
 Class :character   1st Qu.:33.00   1st Qu.:64.00   1st Qu.:150.0  
 Mode  :character   Median :44.00   Median :67.00   Median :175.0  
                    Mean   :44.49   Mean   :67.41   Mean   :181.3  
                    3rd Qu.:55.00   3rd Qu.:70.00   3rd Qu.:205.0  
                    Max.   :79.00   Max.   :79.00   Max.   :410.0  
     hrswrk     
 Min.   : 1.00  
 1st Qu.:38.00  
 Median :40.00  
 Mean   :42.64  
 3rd Qu.:50.00  
 Max.   :89.00  
# Example 2: Mean, Median, Variance, Standard Deviation
mean_value <- mean(gss $income, na.rm = TRUE)
median_value <- median(gss$income, na.rm = TRUE)
variance_value <- var(gss$income, na.rm = TRUE)
sd_value <- sd(gss$income, na.rm = TRUE)

print(paste("Mean:", mean_value))
[1] "Mean: 36887.2183521127"
print(paste("Median:", median_value))
[1] "Median: 27712.5"
print(paste("Variance:", variance_value))
[1] "Variance: 1204246576.23372"
print(paste("Standard Deviation:", sd_value))
[1] "Standard Deviation: 34702.2560683555"

Data Visualization for EDA

# Example 1: Histogram using base R
hist(gss$income, 
     main = "Income Distribution", 
     xlab = "Income", 
     col = "tomato")

# Example 2: Boxplot using base R
boxplot(gss$income, 
        main = "Income Distribution", 
        ylab = "Income", 
        col = "green")

# Example 3: Scatterplot using ggplot2
library(ggplot2)
ggplot(gss, aes(x = age, y = income)) +
  geom_point(color = "red") +
  labs(title = "Age vs Income", x = "Age", y = "Income") +
  theme_minimal()

—————————————

# ---------------------------------------
#Homework
# ---------------------------------------
#Import a Dataset :
#Download a dataset (e.g., from Kaggle or UCI Machine Learning Repository ).
#Import the dataset into R using read.csv() or read_excel().
#Clean the Dataset :
#Handle missing values by either removing them or imputing with the mean/median.
#Detect and treat outliers using the IQR method.
#Perform Exploratory Data Analysis (EDA) :
#Calculate summary statistics (mean, median, variance, standard deviation).
#Create visualizations (histograms, boxplots, scatterplots) to explore relationships in the data.

Session 3: Data Manipulation with dplyr

Data manipulation in R refers to the process of cleaning, transforming, organizing, and preparing data so it can be analyzed or visualized effectively.

In simple terms, it means taking raw data and modifying it to get useful information.

Key ideas in data manipulation

  • Cleaning data – removing errors, duplicates, or missing values

  • Sorting & filtering – arranging data or selecting only what you need

  • Transforming data – changing its format (e.g., numbers to percentages, text to categories)

  • Combining data – merging data from different sources

  • Summarizing data – calculating totals, averages, etc.

Example

Imagine you have a spreadsheet of student scores:

  • You remove incorrect entries (cleaning)

  • Sort scores from highest to lowest (sorting)

  • Calculate the average score (summarizing)

Where it’s used

Data manipulation is widely used in:

  • Data science

  • Statistics

  • Computer science

  • Business analytics and reporting

👉 In short: Data manipulation turns raw data into meaningful, usable information.

Package: dplyr

  • dplyr is used for data manipulation and summarization. It helps to:

    • Select variables

    • Filter rows

    • Create new variables

    • Arrange data

    • Summarize data easily

Install and Load dplyr package

This presentation provides an overview of performing data manipulation in R using dplyr library. It covers key operations such as filtering, selecting specific columns, modifying variables, sorting, summarizing, chaining operations, and dataset reorganization.

Setting Up the Environment

Before performing data manipulation, ensure that you have the required libraries installed:

# install.packages("dplyr")
# Load the library
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Set a working Directory

setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")

Import dataset

gss <-read.csv("GSSsubset.csv")

Using filter() ***Select rows that meet specific conditions.

# Example: Filter rows where age > 30

gss_filtered <- gss |>
  dplyr::filter(age >30)

Using select() ***Choose specific columns from the dataset

# Example: Select only 'sex' and 'income' columns

gss_selected <- gss |>
  dplyr::select(sex, income)

Using mutate() ***Create new variables or modify existing ones

# Example: Add a new column 'salary_category' based on income

gss_Mutated <- gss |>
  dplyr::mutate(salary_category = ifelse(income > 50000, "High", "Low"))

Using arrange()***Sort rows based on one or more variables.

# Example: Arrange data by descending income
gss_arranged <- gss |> 
  dplyr::arrange(desc(income))

Using summarize()****Compute summary statistics for groups

# Example: Calculate mean income grouped by gender
gss |> 
  dplyr:: group_by(sex) |>
  dplyr:: summarize(mean_income = mean(income, na.rm = TRUE))
# A tibble: 2 × 2
  sex    mean_income
  <chr>        <dbl>
1 FEMALE      27300.
2 MALE        46096.

Chaining Operations with |> Combine multiple operations into a single pipeline

# Example: Chain multiple operations
gss_processed <- gss |>
  dplyr::filter(age > 30) |>                # Step 1: Filter rows where age > 30
  dplyr:: select(degree, income, sex) |>   # Step 2: Select specific columns
 dplyr:: mutate(income_category = ifelse(income > 50000, "High", "Low")) |>  # Step 3: Add a new column
 dplyr:: arrange(desc(income))               # Step 4: Arrange by descending income

#print(head(gss_processed))

Reorganize the dataset

# Subsetting data

#gss[1,1]                 # first row first column
#gss[,1]                  # first column
#gss[1,]                  # first row
#gss[,1:2]                # first 2 columns

# subsetting by specific criteria
#gss$income
#gss[gss$income > 1e5,]
#gss$degree
#gss[gss$degree == "GRADUATE",]
#gss$marital
#gss[gss$marital=="DIVORCED",]

Summary report

summary(gss)
       id             sex               degree              income        
 Min.   :   1.0   Length:994         Length:994         Min.   :   369.5  
 1st Qu.: 648.2   Class :character   Class :character   1st Qu.: 15703.8  
 Median :1254.5   Mode  :character   Mode  :character   Median : 27712.5  
 Mean   :1271.2                                         Mean   : 36887.2  
 3rd Qu.:1915.8                                         3rd Qu.: 49882.5  
 Max.   :2538.0                                         Max.   :158657.0  
   marital               age            height          weight     
 Length:994         Min.   :19.00   Min.   :57.00   Min.   : 90.0  
 Class :character   1st Qu.:33.00   1st Qu.:64.00   1st Qu.:150.0  
 Mode  :character   Median :44.00   Median :67.00   Median :175.0  
                    Mean   :44.49   Mean   :67.41   Mean   :181.3  
                    3rd Qu.:55.00   3rd Qu.:70.00   3rd Qu.:205.0  
                    Max.   :79.00   Max.   :79.00   Max.   :410.0  
     hrswrk     
 Min.   : 1.00  
 1st Qu.:38.00  
 Median :40.00  
 Mean   :42.64  
 3rd Qu.:50.00  
 Max.   :89.00  
#aggregate(income ~ sex, data = gss, mean)  #mean income for each gender
#aggregate(income ~ sex, data = gss, max)   # maximum income for each gender
#aggregate(income ~ sex + degree, data = gss, mean) #mean income by gender and level of education
#aggregate(income ~ marital + sex + age, data = gss, mean)
# -----------------------------------------------
#Homework
# ----------------------------------------------
#Practice dplyr Functions :
#Use the dplyr package to manipulate a "car"dataset.
#Perform the following tasks:
#Filter rows based on specific conditions.(Wheelbase>110)
#Select specific columns.
#Create or modify variables using mutate().
#Sort rows using arrange().
#Compute summary statistics using summarize().

Session 4: Data Visualization in R with ggplot2

4.1 Data Visualization

Data visualization helps in understanding patterns, trends, and relationships in data.

It is a crucial element in scientific research, enabling researchers to interpret and communicate their results effectively

4.2 Types of Data Visualization

1. Univariate Data Visualizations (Single Variable)

✔️ Histogram: Used for understanding the distribution of a single variable.

✔️ Box Plot: Used for Detecting outliers and understanding the spread of data.

2. Bivariate Data Visualizations (Two Variables)

✔️ Scatter Plot: Used for understanding relationships between two numerical variables.

✔️ Line Plot: Used for showing trends over time or continuous data.

✔️ Bar Chart: Used for comparing categorical data.

3. Multivariate Data Visualizations (More than Two Variables)

✔️ Heatmap: Used for visualizing correlations between multiple numerical variables.

✔️ Pair Plot: Used for visualizing pairwise relationships in the dataset.

✔️ Violin Plot: Used for understanding the distribution of a variable across categories.

4. Specialized Data Visualizations

✔️ Pie Chart: Used for representing proportions.

✔️ Bubble Chart: Used for adding a third variable to a scatter plot(Comparing three numerical variables)

✔️ Word Cloud: used to highlight keywords, trends, or themes in textual data (Text Analysis : Highlighting key terms in articles, reviews, or social media posts.)

5. Time Series Visualizations

✔️ Time Series Line plot: Used to analyze trends, patterns, or changes in data over a continuous period (e.g., days, months, years).

✔️ Autocorrelation Plot: Used for finding patterns in time series data.

4.3 Best Practices

✅ Choose the Right Chart Type (e.g., bar charts for categories, line charts for trends).

✅ Follow Design Principles (simplicity, consistency, accessibility).

✅ Use Storytelling to highlight key insights and structure visuals logically.

✅ Avoid Common Pitfalls (misleading scales, cluttered visuals, unnecessary 3D charts).

4.5 Data Visualization with ggplot2

What is ggplot2?

  • A powerful plotting system for R, based on the Grammar of Graphics

  • Developed by Hadley Wickham

  • Allows building complex, publication-quality graphics in layers.

  • ggplot2 is the gold standard when it comes to data visualization.

  • ggplot2 is used for drawing graphs and charts in a clear and attractive way. It helps students create: Bar charts, Histograms, Boxplots, Scatter plots, Line graphs etc.

    Here’s why:

✔️ Consistent, intuitive syntax that makes it easy to learn and use across various plot types.

✔️ Seamless integration with other tidyverse packages, enabling smooth data workflows.

✔️ Supports faceting, grouping, and mapping aesthetic.

✔️ Produces professional-quality visuals.

✔️ Efficient handling of large data sets, ensuring smooth and responsive plotting even with complex data.

✔️ Over 100 extensions that enhance its core capabilities, providing endless options for creative visualizations.

✔️ Trusted by more than 1,000 packages, ensuring reliability and broad support.

Remarks:

It breaks down the process of data visualization into layers, making it easier to customize & understand how to build effective charts. Layers are added using the ‘+’ operator.

Essential layers used to create a plot:

1️ Data: The foundation, where you start by defining the data set.

2️ Aesthetics: Map variables to visual aspects like color, size, and position.

3️⃣ Geometries: Specify the type of plot you want, such as bar, line, or scatter.

4️⃣ Facets: Create subplots for different subsets of your data.

5️⃣ Statistics: Add statistical transformations, like mean lines or trend lines.

6️⃣ Coordinates: Control the plot’s coordinate system, such as flipping axes.

7️⃣ Theme: Adjust the overall appearance, like grid lines, font styles, and background.

Install and Load ggplot2 package

#install.packages("ggplot2")
# Load the library
library(ggplot2)

Set a working Directory

setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")

Import the dataset

gss <-read.csv("GSSsubset.csv")

The data

All ggplot2 plots require a data frame as input.

Just running this line will produce a blank plot because we have not stated which elements from the data we want to visualize or how we want to visualize them.

# Initialize a ggplot object with data

# ggplot(gss)           # data: This will show an empty plot as no geom is added yet

gss |> ggplot() 

The aesthetics

Next, we need to specify the visual properties of the plot that are determined by the data.

The aesthetics are specified using the aes() function.

The output should now produce a blank plot but with determined visual properties (e.g., axes labels).

gss |>                                     # data
  ggplot(aes(x = age, y = income))         # aesthetics

The geometries

Finally, we need to specify the visual representation of the data. The geometries are specified using the geom_*() function.

There are many different types of geometries that can be used in ggplot2.

We will use geom_point() in this example and we will append it to the previous plot using the + operator.

The output should now produce a plot with the specified visual representation of the data.

use geom_point()

gss |>                                  # data
  ggplot(aes(x = age, y = income)) +    # aesthetics
  geom_point()                          # geometry

# change the color of point to my choice

gss |>                                  # data
  ggplot(aes(x = age, y = income)) +    # aesthetics
  geom_point(color = "Tomato")          # geometry (change the color of point to my choice)

#color point in the plot by marital

gss |>                                  # data
  ggplot(aes(x = age, y = income,colour = marital )) +    # aesthetics (color point in the plot by marital)
  geom_point()                          # geometry

# change the point size in the plot

gss |>                                  # data
  ggplot(aes(x = age, y = income,colour = marital )) +    # aesthetics (color point in the plot by marital)
  geom_point(size = 4)                          # geometry(change the point size in the plot )

Histogram: Used for understanding the distribution of a single variable.

We will use geom_histogram() to the plot using the + operator.

use geom_histogram()

gss |>                            # data
  ggplot(aes(x = income)) +       # aesthetics
  geom_histogram()                # geometries
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# change the color
gss |> 
  ggplot(aes(x = income)) +       
  geom_histogram(fill = "red")         
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# Change the color & specifies the number of bars 
gss |> 
  ggplot(aes(x = income)) +       
  geom_histogram(fill = "red", bins = 5)         

# Change the color & specifies the number of bars 
gss |> 
  ggplot(aes(x = income)) +     
  geom_histogram(fill = "red", bins = 5, color = "black")  

Labels: Adding Titles and Labels

Clear titles and labels are essential for making your plots understandable.

Labels can be added to various components of a plot using the labs() function.

# Labels: Adding Titles and Labels
gss |> 
  ggplot(aes(x = income)) +     
  geom_histogram(fill = "red", bins = 5, color = "black") +
  labs(x = "Income in Kshs",
       y = "No. of Respondents",
       title =  "Histogram showing income Distribution",
       caption = "Source: CDAM Experts, 2026") +
       theme_classic()

Bar Chart: Used for comparing categorical data.

use geom_bar()

# Create a bar plot
gss |>                           # data
  ggplot(aes(x= age)) +          # aesthetics
  geom_bar()                     # geometrics

# geometrics (the "fill" argument specifies the color of the bars)
gss |>                           
  ggplot(aes(x= age)) +          
  geom_bar(fill = "blue")                    

# Labels: Adding Titles and Labels
gss |>                           
  ggplot(aes(x= age)) +          
  geom_bar(fill = "blue") +
  labs(x = "Age in Years",
       y = "No. of Respondents",
       title =  "A Bar chart showing age Distribution",
       caption = "Source: CDAM Experts, 2026") +
       theme_classic()

Boxplot: Used for Detecting outliers and understanding the spread of data.

We will use geom_boxplot() using the + operator.

use geom_boxplot()

# create boxplot chart
gss |> 
  ggplot(aes(x= degree, y = income)) +
  geom_boxplot()

# geometrics (the "fill" argument specifies the color of the boxplot)
gss |> 
  ggplot(aes(x= degree, y = income)) +
  geom_boxplot(fill = "red")

# The "fill" argument specifies the color of boxplot by sex)
gss |> 
  ggplot(aes(x= degree, y = income, fill = sex)) +
  geom_boxplot()

# adds random noise (jitter) to points in a scatter plot to reduce overplotting when many points overlap
gss |> 
  ggplot(aes(x= degree, y = income, fill = sex)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.2) # Controls transparency (opacity) of points[from 0 (completely transparent) to 1 (fully opaque)]

# Labels: Adding Titles and Labels
gss |> 
  ggplot(aes(x= degree, y = income, fill = sex)) +
  geom_boxplot() + 
  labs(x = "Education Qualification",
       y = "Income Levels",
       title =  "Boxplot showing of income Distribution by Degree",
       caption = "Source: CDAM Experts, 2026") +
       theme_classic()

gss |> 
  ggplot(aes(x= degree, y = income, fill = sex)) +
  geom_boxplot() + 
  labs(x = "Education Qualification",
       y = "Income Levels",
       title =  "Boxplot showing of income Distribution by Degree",
       caption = "Source: CDAM Experts, 2026") +
       theme_classic() + 
  theme(legend.position = "top")

Themes

The “theme” function is used to specify the theme of the plot.

There are many preset theme functions, and further custom themes can be created using the generic theme() function.

There are many different themes that can be used in ggplot2.

Typically you will want to set the theme at the end of your plot

Scatter Plot***Visualize the relationship between two continuous variables.

# Example: Scatter plot of 'income' vs 'age'
gss |> 
  ggplot(aes(x = age, y = income)) +
  geom_point(color = "blue") +
  labs(title = "Income vs Age", 
       x = "Age", 
       y = "Income") +
  theme_minimal()

Faceting: Creating Small Multiples

Facets are a powerful feature of ggplot2 that allow us to create multiple plots based on a single variable.

This “small multiple” approach making it easy to compare distributions or relationships across groups.

Facets also make use of the ~ operator.

gss |> 
  ggplot(aes(x = age, y = income)) +
  geom_point(color = "blue") +
    facet_wrap(~ sex) + 
  labs(title = "Income vs Age", 
       x = "Age", 
       y = "Income") +
  theme_classic()

# Create multiple bar plots using facet_wrap()
gss |> 
  ggplot(aes(x = income)) +     
  geom_histogram(fill = "red", bins = 5, color = "black") +
   facet_wrap(~ sex) +
  labs(x = "Income in Kshs",
       y = "No. of Respondents",
       title =  "Histogram showing income Distribution",
       caption = "Source: CDAM Experts, 2026") +
       theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) # Align the title to the center

Exporting plots

We can export plots to a variety of formats using the ggsave() function.

We can specify which plot to export by saving in an object and then calling the object in the ggsave() function, otherwise ggsave() will save the current/last plot.

The width and height of the output image using the width and height can be set using the width and height arguments, and the resolution of the image using the dpi argument.

The file type can be set using the format argument, or by using a specific file extension.

I recommend using informative names for the output file so that it is easily identifiable.

gss |> 
  ggplot(aes(x = income)) +     
  geom_histogram(fill = "blue", bins = 5, color = "black") +
   facet_wrap(~ sex) +
  labs(x = "Income in Kshs",
       y = "No. of Respondents",
       title =  "Histogram showing income Distribution",
       caption = "Source: CDAM Experts, 2026") +
       theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) # Align the title to the center

plot_1 <-ggsave("Plot_1.png", width = 10, height = 6, dpi = 300)

Interactive Visualizations: Use tools like Plotly to enable user interaction.

library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
# Create a ggplot

p = gss |> 
  ggplot(aes(x= degree, y = income, fill = sex)) +
  geom_boxplot() + 
  labs(x = "Education Qualification",
       y = "Income Levels",
       title =  "Boxplot showing of income Distribution by Degree",
       caption = "Source: CDAM Experts, 2026") +
       theme_classic() + 
  theme(legend.position = "top")

# Convert to interactive plot
ggplotly(p)

Violin plots

Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.

Typically, violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots.

Key function:

geom_violin(): Creates violin plots. Key arguments:

fill: Areas fill color

gss |> ggplot(aes(x = age, y = income)) +
  geom_violin(fill = "tomato") +
  geom_jitter(alpha = 0.2) +
  theme_classic()

# -----------------------------------------------
#Homework
#Dataset:mockdata_cases.csv provided
# -----------------------------------------------
#Create Visualizations for a Dataset :
#Use the ggplot2 package to create visualizations for a dataset of your choice.
#Perform the following tasks:
#Create a scatter plot to visualize the relationship between two continuous variables.
#Create a bar plot to display the count or summary of a categorical variable.
#Create a histogram to show the distribution of a single variable.
#Create a boxplot to summarize the distribution of a continuous variable across categories.
#Customize your plots with titles, labels, themes, and aesthetic modifications.

Session 5: Hypothesis Testing in R

5.1 Learning Objectives

By the end of this session, learners should be able to:

  • Understand the logic and framework of hypothesis testing

  • Formulate null and alternative hypotheses correctly

  • Select appropriate statistical tests

  • Perform hypothesis tests in R

  • Interpret results in a statistically sound manner

5.2 Conceptual Foundation

5.2.1 What is Hypothesis Testing?

Hypothesis testing is a statistical inference method used to make decisions about a population parameter based on sample data.

It answers:

Is the observed effect real, or due to random chance?

5.2.2 Key Terminology

Term Description
Null Hypothesis (H₀) Assumes no effect or no difference
Alternative Hypothesis (H₁) Assumes there is an effect or difference
Significance Level (α) Probability of rejecting H₀ when it is true (commonly 0.05)
p-value Probability of observing results at least as extreme as the sample
Test Statistic Value calculated from sample data
Type I Error Rejecting a true H₀
Type II Error Failing to reject a false H₀

5.2.3 General Steps in Hypothesis Testing

  1. Define hypotheses (H₀ and H₁)

  2. Choose significance level (α)

  3. Select appropriate test

  4. Compute test statistic

  5. Calculate p-value

  6. Make decision:

-    If p ≤ α → Reject H₀

-    If p \> α → Fail to reject H₀

One-Sample t-Test

Test whether the mean of a sample is significantly different from a specified value.

# Example: One-sample t-test
sample_data = c(22, 24, 26, 28, 30, 32, 34, 36)
known_mean = 30

# Perform one-sample t-test
result <- t.test(sample_data, mu = known_mean)

# Print results
print(result)

    One Sample t-test

data:  sample_data
t = -0.57735, df = 7, p-value = 0.5818
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
 24.90435 33.09565
sample estimates:
mean of x 
       29 

Two-Sample t-Test

Test whether the means of two independent samples are significantly different.

# Example: Two-sample t-test
group1 <- c(22, 24, 26, 28, 30)
group2 <- c(32, 34, 36, 38, 40)

# Perform two-sample t-test
result = t.test(group1, group2)

# Print results
print(result)

    Welch Two Sample t-test

data:  group1 and group2
t = -5, df = 8, p-value = 0.001053
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -14.612008  -5.387992
sample estimates:
mean of x mean of y 
       26        36 

Paired t-Test

Test whether the means of two related samples are significantly different.

# Example: Paired t-test
before <- c(21, 24, 26, 28, 30)
after <- c(24, 26, 27, 30, 32)

# Perform paired t-test
result <- t.test(before, after, paired = TRUE)

# Print results
print(result)

    Paired t-test

data:  before and after
t = -6.3246, df = 4, p-value = 0.003198
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.877989 -1.122011
sample estimates:
mean difference 
             -2 

Chi-Square Test for Independence

Test whether there is a significant association between two categorical variables.

# Example: Chi-square test for independence
data <- matrix(c(20, 10, 15, 25), nrow = 2, byrow = TRUE)
rownames(data) <- c("Group1", "Group2")
colnames(data) <- c("CategoryA", "CategoryB")
data
       CategoryA CategoryB
Group1        20        10
Group2        15        25
# Perform chi-square test
result <- chisq.test(data)

# Print results
print(result)

    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 4.725, df = 1, p-value = 0.02973

5.3 Dataset Context: GSSsubset.csv

The dataset is a subset of the General Social Survey (GSS), containing:

  • Categorical variables: gender, marital status, education level, political views

  • Numerical variables: age, income, years of education

5.4 Loading and Inspecting Data in R

# Load required libraries 
library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ plotly::filter() masks dplyr::filter(), stats::filter()
✖ dplyr::lag()     masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Import dataset 
gss <- read.csv("GSSsubset.csv")  

# Inspect structure 
str(gss) 
'data.frame':   994 obs. of  9 variables:
 $ id     : int  1 2 4 14 16 19 21 27 28 30 ...
 $ sex    : chr  "MALE" "FEMALE" "FEMALE" "FEMALE" ...
 $ degree : chr  "BACHELOR" "BACHELOR" "BACHELOR" "HIGH SCHOOL" ...
 $ income : num  60968 60968 10161 17551 17551 ...
 $ marital: chr  "DIVORCED" "MARRIED" "MARRIED" "MARRIED" ...
 $ age    : int  53 26 56 40 56 51 30 35 57 54 ...
 $ height : int  72 60 68 65 66 68 62 70 71 71 ...
 $ weight : int  190 97 160 156 210 170 115 180 225 165 ...
 $ hrswrk : int  60 40 20 37 6 50 38 40 40 40 ...
glimpse(gss)
Rows: 994
Columns: 9
$ id      <int> 1, 2, 4, 14, 16, 19, 21, 27, 28, 30, 32, 38, 40, 44, 45, 46, 4…
$ sex     <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "FEMALE"…
$ degree  <chr> "BACHELOR", "BACHELOR", "BACHELOR", "HIGH SCHOOL", "HIGH SCHOO…
$ income  <dbl> 60967.50, 60967.50, 10161.25, 17551.25, 17551.25, 15703.75, 17…
$ marital <chr> "DIVORCED", "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRI…
$ age     <int> 53, 26, 56, 40, 56, 51, 30, 35, 57, 54, 61, 31, 35, 26, 50, 43…
$ height  <int> 72, 60, 68, 65, 66, 68, 62, 70, 71, 71, 64, 67, 69, 76, 67, 68…
$ weight  <int> 190, 97, 160, 156, 210, 170, 115, 180, 225, 165, 128, 150, 200…
$ hrswrk  <int> 60, 40, 20, 37, 6, 50, 38, 40, 40, 40, 40, 39, 50, 45, 60, 40,…
#summary(gss) 

Key Checks

  • Identify variable types (numeric, factor)

  • Check missing values

# Check missing values
colSums(is.na(gss))
     id     sex  degree  income marital     age  height  weight  hrswrk 
      0       0       0       0       0       0       0       0       0 

5.5 Hypothesis Testing Framework

5.5.1 General Structure

  • H₀ (Null): No effect / no difference

  • H₁ (Alternative): There is an effect / difference

5.5.2 Decision Rule

  • Reject H₀ if p-value ≤ 0.05

  • Otherwise, fail to reject H₀

5.6 Selecting the Right Test

Scenario Variables Test
Mean vs constant Numeric One-sample t-test
Mean difference (2 groups) Numeric + categorical Two-sample t-test
Paired observations Numeric (paired) Paired t-test
Association Categorical Chi-square test
Correlation Numeric + numeric Correlation test

5.7 Hands-On Analysis

5.7.1 One-Sample t-Test

Used to compare sample mean with a known value.

Research Question: Is the average age different from 40?

t.test(gss$age, mu = 40)

    One Sample t-test

data:  gss$age
t = 10.781, df = 993, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 40
95 percent confidence interval:
 43.67189 45.30598
sample estimates:
mean of x 
 44.48893 

Interpretation:

  • Check p-value

  • If p < 0.05 → Mean age significantly differs from 40

5.7.2 Two-Sample t-Test

Compares means between two independent groups.

Research Question: Does income differ by gender?

t.test(income ~ sex, data = gss)

    Welch Two Sample t-test

data:  income by sex
t = -8.9504, df = 825.42, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
95 percent confidence interval:
 -22917.25 -14673.48
sample estimates:
mean in group FEMALE   mean in group MALE 
            27300.45             46095.81 

Key Output Components: -8.9504

  • Mean of each group

  • Confidence interval

  • p-value

Reporting Results (Standard Format)

Example:

A two-sample t-test was conducted to compare income between gender. The results showed a statistically significant difference (t = -8.9504, df = 825.42, p-value < 2.2e-16). Therefore, we reject the null hypothesis and conclude that income differs by gender.

5.7.3 Paired t-Test

Used when observations are dependent (e.g., before vs after).

Only if dataset has repeated measures.

before <- c(120, 130, 125, 140)
after  <- c(115, 128, 120, 135)

t.test(before, after, paired = TRUE)

    Paired t-test

data:  before and after
t = 5.6667, df = 3, p-value = 0.01088
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.863165 6.636835
sample estimates:
mean difference 
           4.25 

5.7.4 Chi-Square Test (Categorical Association)

Used for categorical data (association or independence).

# Contingency table
data <- matrix(c(10, 20, 30, 40), nrow = 2)

chisq.test(data)

    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 0.44643, df = 1, p-value = 0.504

Research Question: Is education level associated with marital status?

table_data <- table(gss$degree, gss$marital) 

chisq.test(table_data)
Warning in chisq.test(table_data): Chi-squared approximation may be incorrect

    Pearson's Chi-squared test

data:  table_data
X-squared = 41.217, df = 16, p-value = 0.0005158

Interpretation:

  • p < 0.05 → variables are dependent

5.7.5 Correlation Test

Research Question: Is age correlated with income?

cor.test(gss$age, gss$income)

    Pearson's product-moment correlation

data:  gss$age and gss$income
t = 6.8048, df = 992, p-value = 1.748e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1509841 0.2698184
sample estimates:
      cor 
0.2111815 

5.8 Assumption Checking

5.8.1 Normality (for t-tests)

shapiro.test(gss$age)

    Shapiro-Wilk normality test

data:  gss$age
W = 0.97874, p-value = 7.129e-11

Or visually:

hist(gss$age)

qqnorm(gss$age) 
qqline(gss$age)

5.8.2 Equal Variance (F-test)

var.test(income ~ sex, data = gss)

    F test to compare two variances

data:  income by sex
F = 0.34694, num df = 486, denom df = 506, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2909261 0.4138759
sample estimates:
ratio of variances 
         0.3469425 

5.8.3. Non-Parametric Alternative

wilcox.test(income ~ sex, data = gss)

    Wilcoxon rank sum test with continuity correction

data:  income by sex
W = 84528, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Practical Exercise (In-Class)

Task Set

  1. Test whether:
-    Mean age differs from 35

-    Income differs by education level
  1. Check association:
-    Gender vs marital status
  1. Compute correlation:
-    Age vs education years

Homework Assignment

Part A: Hypothesis Development

  • Formulate 3 research questions from GSS dataset

  • Define H₀ and H₁ clearly

Part B: Implementation in R

  • Perform:

    • One t-test

    • One chi-square test

    • One correlation test

Part C: Interpretation

For each:

  • Test statistic

  • p-value

  • Decision

  • Real-world interpretation

Session 6: Correlation and Regression Analysis

6.1 Correlation analysis

Correlation measures the strength and direction of a linear relationship between two variables.

  • Range: From -1 to +1.

    • +1: Perfect positive relationship (both increase together).

    • -1: Perfect negative relationship (one increases while the other decreases).

    • 0: No linear relationship.

  • Common Measures:

    • Pearson’s correlation coefficient (r): For continuous, normally distributed data.

    • Spearman’s rank correlation: For ranked/ordinal data.

    • Kendall’s tau: For ordinal data with ties.

  • Example: Hours studied vs. exam score → r = 0.915 indicates a strong positive correlation.

Hands-On Exercises

# Example: Pearson correlation
data <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = c(2, 4, 5, 7, 6))

head(data)
  x y
1 1 2
2 2 4
3 3 5
4 4 7
5 5 6
# Calculate Pearson correlation
pearson_corr <- cor(data$x, data$y, method = "pearson")
print(paste("Pearson Correlation:", pearson_corr))
[1] "Pearson Correlation: 0.904194430179465"
# Test significance of Pearson correlation
pearson_test <- cor.test(data$x, data$y, method = "pearson")
print(pearson_test)

    Pearson's product-moment correlation

data:  data$x and data$y
t = 3.6667, df = 3, p-value = 0.03508
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1084151 0.9937257
sample estimates:
      cor 
0.9041944 
data <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = c(2, 4, 5, 7, 6))

# Example: Spearman correlation
spearman_corr <- cor(data$x, data$y, method = "spearman")
print(paste("Spearman Correlation:", spearman_corr))
[1] "Spearman Correlation: 0.9"
# Test significance of Spearman correlation
spearman_test <- cor.test(data$x, data$y, method = "spearman")
print(spearman_test)

    Spearman's rank correlation rho

data:  data$x and data$y
S = 2, p-value = 0.08333
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho 
0.9 

6.2 Regression Analysis

Regression models the relationship between a dependent variable (response) and one or more independent variables (predictors).

  • Purpose: Explains and predicts how changes in X affect Y.

  • Types:

    • Simple Linear Regression: One predictor, equation form: Y=a+bX.

    • Multiple Linear Regression: Several predictors.

    • Logistic Regression: For categorical outcomes.

    • Polynomial Regression: Models nonlinear relationships.

  • Interpretation Example: Predicted exam score = 65.47 + 2.58 × (hours studied).

    • Intercept (65.47): Expected score with zero study hours.

    • Slope (2.58): Average score increase per extra study hour.

Hands-On Exercises

Simple Linear Regression***Use lm() to fit a simple linear regression model

# sample dataset
data <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = c(2, 4, 5, 7, 6))

# Example: Simple linear regression
model <- lm(y ~ x, data = data)

# Print summary of the model
summary(model)

Call:
lm(formula = y ~ x, data = data)

Residuals:
   1    2    3    4    5 
-0.6  0.3  0.2  1.1 -1.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    1.500      0.995   1.508   0.2288  
x              1.100      0.300   3.667   0.0351 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9487 on 3 degrees of freedom
Multiple R-squared:  0.8176,    Adjusted R-squared:  0.7568 
F-statistic: 13.44 on 1 and 3 DF,  p-value: 0.03508
### Extracting the Parameters from the Model
model$coefficients
(Intercept)           x 
        1.5         1.1 
summary(model)$r.square
[1] 0.8175676

Multiple Linear Regression***Extend the model to include multiple independent variables.

# Example: Multiple linear regression
data <- data.frame(
  y = c(2, 4, 5, 4, 6),
  x1 = c(1, 2, 3, 4, 5),
  x2 = c(3, 5, 7, 6, 8))

# Fit the model
model <- lm(y ~ x1 + x2, data = data)

# Print summary of the model
summary(model)

Call:
lm(formula = y ~ x1 + x2, data = data)

Residuals:
       1        2        3        4        5 
-0.06667  0.33333 -0.26667 -0.20000  0.20000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.4222     0.6748  -0.626   0.5954  
x1           -0.1778     0.2703  -0.658   0.5784  
x2            0.8889     0.2222   4.000   0.0572 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3651 on 2 degrees of freedom
Multiple R-squared:  0.9697,    Adjusted R-squared:  0.9394 
F-statistic:    32 on 2 and 2 DF,  p-value: 0.0303

Interpreting Results***Understand coefficients, R-squared, and p-values.

# Example: Interpreting regression results

summary(model)$coefficients  # View coefficients
              Estimate Std. Error    t value   Pr(>|t|)
(Intercept) -0.4222222  0.6747656 -0.6257317 0.59537818
x1          -0.1777778  0.2703450 -0.6575959 0.57836298
x2           0.8888889  0.2222222  4.0000000 0.05719096
summary(model)$r.squared     # View R-squared value
[1] 0.969697

Model Diagnostics

# Example: Residual analysis: Plot residuals to check for patterns.
par(mfrow = c(1, 2))  # Set up a 1x1 plot layout
plot(model)           # Generate diagnostic plots

# Example: Multicollinearity (VIF): Use Variance Inflation Factor (VIF) to detect multicollinearity.
library(car)  # Install and load the 'car' package if not already installed
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:purrr':

    some
The following object is masked from 'package:dplyr':

    recode
vif(model)    # Calculate VIF values
      x1       x2 
5.481481 5.481481 

Import data