Session 1: Introduction to R for Data Science
1.1 What is R?
R is an open-source programming language for statistical computing and visualization. It is widely used in data science, research, and analytics.
R is a programming language designed for:
Statistical analysis
Data visualization
Data manipulation
Reproducible research
It is widely used in:
Academic research
Data science and machine learning
Business analytics
Key Characteristics
Open-source (free to use)
Strong statistical capabilities
Thousands of packages (e.g.,
dplyr,ggplot2)Supports reproducible research (via R Markdown / Quarto)
1.2 What is RStudio?
RStudio is an Integrated Development Environment (IDE) for R. It makes working with R easier and more organized.
Main Components of RStudio
When you open RStudio, you will see four panels:
- Source (Script Editor)
Write and save your code
Recommended for all serious work
- Console
Execute R commands directly
Immediate output
- Environment/History
Shows stored variables and datasets
Tracks command history
- Files/Plots/Packages/Help
Files: navigate directories
Plots: display graphs
Packages: install/load libraries
Help: documentation
1.3 Installing R
Step 1: Download R
Go to the official repository:
👉 https://cran.r-project.org/
Choose your operating system:
Windows
macOS
Linux
Step 2: Install R (Windows example)
Click Download R for Windows
Click base
Download the latest version
Run the
.exefileFollow installation prompts (default settings are fine)
Step 3: Verify Installation
Open R (or RStudio later) and run:
#versionIf R is installed correctly, version details will appear.
1.4 Installing RStudio
Step 1: Download RStudio
Go to:
👉 https://posit.co/download/rstudio-desktop/
Choose:
- RStudio Desktop (Free version)
Step 2: Install
Download installer
Run setup file
Follow installation steps
Step 3: Launch RStudio
Open RStudio
It should automatically detect your R installation
1.5 What is Data Science?
Data science is an interdisciplinary field that combines statistics, computational tools and domain knowledge to extract meaningful insights and knowledge from data.
1.5.1 Key Components of Data Science
Statistics and Mathematics
Foundation for inference and modeling
Examples: hypothesis testing, regression, probability
Programming and Computing
Tools to manipulate and analyze data
Common languages: R, Python
Domain Knowledge
Understanding the problem context (e.g., healthcare, finance)
Ensures results are meaningful and actionable
1.5.2 Why R for Data Science?
Strong statistical foundation
Rich ecosystem (
dplyr,ggplot2,caret)Ideal for research and modeling
1.5.3 Core Idea
At its core, data science answers questions like:
What is happening? (descriptive analysis)
Why is it happening? (diagnostic analysis)
What will happen next? (predictive modeling)
What should we do? (prescriptive analytics)
1.5.4 Data Science Workflow (Step-by-Step)
Step 1: Problem Definition
- Clearly define the research or business question
Step 2: Data Collection
- Gather data from sources (databases, APIs, surveys, sensors)
Step 3: Data Cleaning
- Handle missing values, errors, inconsistencies
Step 4: Exploratory Data Analysis (EDA)
Summarize and visualize data
Identify patterns, trends, outliers
Step 5: Modeling
Apply statistical or machine learning models
Examples: regression, classification, clustering
Step 6: Evaluation
Assess model performance
Metrics: accuracy, RMSE, AUC
Step 7: Communication
- Present findings using reports, dashboards, or visualizations
1.5.5 Types of Data Science Tasks
Classification → Predict categories (e.g., disease vs no disease)
Regression → Predict continuous values (e.g., price, temperature)
Clustering → Group similar observations
Time Series Analysis → Analyze data over time
Natural Language Processing (NLP) → Work with text data
1.5.6 Applications of Data Science
Healthcare
Disease prediction
Patient risk modeling
Finance
Fraud detection
Credit scoring
Business
Customer segmentation
Sales forecasting
Agriculture
Crop prediction
Pest/disease detection
Social Media
Sentiment analysis
Recommendation systems
1.5.7 Tools Used in Data Science
Programming: R, Python
Visualization:
ggplot2, Tableau, Power BIDatabases: SQL
Machine Learning:
caret,scikit-learn (in python), XGBoost
1.5.8 Simple Example
Problem: Predict house prices
Steps:
Collect data (size, location, price)
Clean dataset
Explore relationships
Fit regression model
Evaluate model accuracy
Predict new house prices
1.5.9 Summary
Data science combines:
Statistics (to reason about data)
Programming (to process data)
Domain knowledge (to interpret results)
Its goal is to transform raw data into useful insights and decisions.
1.6 Getting Started
Step 1: Open RStudio
You will see:
Creating an R script
Click File ->New File -> R Script
The script editor opens with an empty script
Run Your First Command
In the Console:
# Example 1: Print a message
message("Hello, R!")Hello, R!
print("I am a boy"); print("And i go to school"); print("And i love sharing")[1] "I am a boy"
[1] "And i go to school"
[1] "And i love sharing"
Create a Script
Click File → New File → R Script
Type:
x <- 10
y <- 5
x + y[1] 15
Step 2: Basic Commands
# Arithmetic
2 + 2 # addition[1] 4
5 * 3 # multiplication [1] 15
10 / 2 # division [1] 5
sqrt(16) # square root[1] 4
log(10) # logarithm [1] 2.302585
Assigning Variables
x <- 10
y <- 5
x + y[1] 15
Data Types (Core Structures)
Vectors
A vector is the most basic data structure in R. It is a one-dimensional collection of elements where all elements must be of the same data type.
Think of it as a single column of data.
# Create a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
print(numeric_vector)[1] 1 2 3 4 5
# Create a character vector
character_vector <- c("apple", "banana", "cherry")
print(character_vector)[1] "apple" "banana" "cherry"
# Access elements of a vector
print(numeric_vector[1]) # Output: 1[1] 1
print(numeric_vector[2:4]) # Second to fourth elements[1] 2 3 4
print(character_vector[2]) # Output: "banana"[1] "banana"
# Modify a vector
numeric_vector[3] <- 10 # Change the third element to 10
print(numeric_vector)[1] 1 2 10 4 5
# Vector operations
sum_vector <- numeric_vector + c(1, 1, 1, 1, 1) # Add 1 to each element
print(sum_vector)[1] 2 3 11 5 6
# Vector arithmetic
vector_a <- c(1, 2, 3)
vector_b <- c(4, 5, 6)
result <- vector_a + vector_b
print(result) # Output: 5 7 9[1] 5 7 9
# Logical vector
flags <- c(TRUE, FALSE, TRUE)Matrices
A matrix in R is a two-dimensional data structure that stores data in rows and columns, where all elements must be of the same data type.
Think of it as a table of numbers used mainly for mathematical and statistical computations.
# Create a matrix
matrix_data <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
print(matrix_data) [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
# Access elements of a matrix
print(matrix_data[1, 2]) # Element in the first row, second column[1] 3
print(matrix_data[, 3]) # Third column[1] 5 6
print(matrix_data[2, ]) # Second row[1] 2 4 6
# Matrix operations
matrix_transpose <- t(matrix_data) # Transpose the matrix
print(matrix_transpose) [,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
# Matrix multiplication
matrix_a <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
matrix_b <- matrix(c(5, 6, 7, 8), nrow = 2, ncol = 2)
result <- matrix_a %*% matrix_b
print(result) # Output: [[19, 22], [43, 50]] [,1] [,2]
[1,] 23 31
[2,] 34 46
Lists
A list is a flexible data structure that can store different types of objects together in one container.
# Example 1: Create a list
my_list <- list(name = "Alice", age = 25, scores = c(85, 90, 95))
print(my_list)$name
[1] "Alice"
$age
[1] 25
$scores
[1] 85 90 95
# Access elements of a list
print(my_list$name) # Access the "name" element[1] "Alice"
# Modify a list
my_list$scores <- c(80, 85, 90) # Update the "scores" element
print(my_list)$name
[1] "Alice"
$age
[1] 25
$scores
[1] 80 85 90
# Example 2: Create a list
my_list <- list(name = "John", age = 30, hobbies = c("reading", "coding"))
print(my_list)$name
[1] "John"
$age
[1] 30
$hobbies
[1] "reading" "coding"
# Example 3: Access elements of a list
print(my_list$name) # Output: "John"[1] "John"
print(my_list$hobbies[1]) # Output: "reading"[1] "reading"
Data Frames:
A data frame in R is a two-dimensional data structure used to store data in tabular form, similar to a spreadsheet or SQL table.
Rows → observations (records)
Columns → variables (features)
Each column can contain different data types (numeric, character, factor, etc.), but within a column, all values must be of the same type.
# Example 1: Create a data frame
df <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Salary = c(50000, 60000, 70000))
print(df) Name Age Salary
1 Alice 25 50000
2 Bob 30 60000
3 Charlie 35 70000
# Example 2: Access columns of a data frame
print(df$Name) # Output: "Alice" "Bob" "Charlie"[1] "Alice" "Bob" "Charlie"
print(df$Age[2]) # Output: 30[1] 30
# Example 3: Add a new column
df$City <- c("New York", "Los Angeles", "Chicago")
print(df) Name Age Salary City
1 Alice 25 50000 New York
2 Bob 30 60000 Los Angeles
3 Charlie 35 70000 Chicago
Basic Functions
x <- c(1:10, 1:5, NA)
x [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 NA
length(x)[1] 16
max(x,na.rm = TRUE)[1] 10
max(x, na.rm = TRUE) # Find the maximum value in x, excluding missing values[1] 10
min(x, na.rm = TRUE) # minimum[1] 1
mean(x, na.rm = TRUE) # mean[1] 4.666667
median(x, na.rm = T) # median [1] 4
sum(x, na.rm = T) # sum[1] 70
var(x, na.rm = T) # variance[1] 8.095238
sd(x, na.rm = T) # standard deviation[1] 2.845213
summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 2.500 4.000 4.667 6.500 10.000 1
table(x) # Frequency counts of entriesx
1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 1 1 1 1 1
length(x) # length of x [1] 16
is.na(x) # check if each element in x is missing [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE
which(is.na(x)) # the index/location of missing value in the vector x[1] 16
which(x == 1) # the index/ location of a particular value in the vector x[1] 1 11
To save: click File -> Save as
Exercise
Create a vector of 10 random numbers
Compute mean and standard deviation
set.seed(123)
x <- rnorm(10)
mean(x)[1] 0.07462564
sd(x)[1] 0.9537841
Installing and Loading Packages
Packages extend R functionality.
Install a package (run once)
#install.packages("ggplot2")
#install.packages("tidy verse")Load a package (every session)
library(ggplot2)Working Directory (Important Concept)
The working directory is where R reads and saves files.
Check current directory
#getwd()Set directory
#setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")(In RStudio, you can also use: Session → Set Working Directory)
Common Beginner Mistakes
Running code without saving scripts
Not setting working directory
Forgetting to load packages (
library())Confusing assignment
<-with=
Core Difference
<- (Assignment Operator)
Standard and preferred for assigning values to variables
Explicitly indicates assignment
x <- 10= (Assignment or Argument Matching)
Can assign values but also used to pass arguments to functions
Context-dependent → can introduce ambiguity
x = 10Why <- is Recommended
Reason : Avoid ambiguity in functions
mean(x = c(1, 2, 3))[1] 2
Here, x = is not assignment
It is passing an argument to the function mean()
Compare:
x <- c(1, 2, 3)
mean(x)[1] 2
Clear separation:
<-→ assignment=→ function arguments
# -------------------------------------------------------------
# Homework
# -------------------------------------------------------------
#Practice Basic Operations
#Create a vector of numbers from 1 to 10.
#Calculate the mean, sum, and length of the vector.
#Multiply each element of the vector by 2.
#2. Data Manipulation :
#Create a matrix of size 3x3 with random numbers.
#Find the transpose of the matrix.
#Create a data frame with three columns (Name, Age, City) and add a new column for "Country".Session 2: Data Import, Cleaning and EDA
2.1 Overview
This stage is the foundation of any data science workflow. Poor data quality leads to invalid models and misleading conclusions.
Objectives
Import data correctly
Clean and prepare datasets
Understand structure and quality
Explore patterns using statistics and visualization
2.2 Data Importation
Data importation is the first step in any data analysis workflow. In Python, the pandas library is widely used for loading data from various sources, such as CSV, Excel, SQL databases, JSON, and even web scraping. Proper data importation ensures that the dataset is structured correctly and is ready for further preprocessing This comprehensive example demonstrates:
CSV Files: The most common format for storing tabular data, where values are separated by comms.
Excel Files: Useful for structured data with multiple s Using pandas’ read_excel.ts.
JSON and XML Files: Common in web applications and PIs.
SQL Databases
2.3. Data Cleaning
Real-world datasets often contain missing values, duplicates, and inconsistencies that must be addressed before analysis. Data cleaning ensures that the dataset is structured correctly and free from errors that might affect the accuracy of the results.
Handling Missing Data
Missing values can occur due to data entry errors, incomplete records, or system issues. There are several ways to handle missing data:
Deletion: Removing rows or columns with missing values if they are minimal.
Imputation: Replacing missing values with statistical measures such as the mean, median, or mode.
Forward or Backward Fill: Filling missing values using previous or next available values in time-series data.
Handling Outliers
Outliers are extreme values that differ significantly from the rest of the data and can distort analysis results. They can be detected using statistical methods such as:
Z-score method: Identifies data points that are several standard deviations away from the mean.
Interquartile Range (IQR) method: Identifies outliers based on quartiles.
Visualization methods: Box plots and scatter plots help in detecting extreme values.
2.4 Summary Statistics
Summary statistics provide insights into the distribution, central tendency, and spread of the data. Some of the key statistical measures include:
Mean: The average value of a dataset.
Median: The middle value when the data is sorted.
Mode: The most frequently occurring value.
Variance and Standard Deviation: Measures the spread of data around the mean.
Skewness and Kurtosis: Used to understand the shape of the distribution.
These statistics help analysts understand the nature of the dataset and whether further transformations are necessary before modeling.
2.5 Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their key characteristics, often with the help of visualizations. It helps to identify patterns, trends, and potential issues within the data.
Data Visualization
Visualization is a key part of EDA, as it provides an intuitive understanding of data relationships. Common types of visualizations include:
Histograms: Show the distribution of numerical data.
Box Plots: Identify outliers and the spread of data.
Scatter Plots: Show relationships between two numerical variables.
Bar Charts: Compare categorical data.
Heatmaps: Display correlations between multiple numerical variables.
Using Matplotlib and Seaborn, analysts can create these visualizations to better understand the data before applying statistical or machine learning models.
2.6 Importance of EDA
EDA is a critical step before building models, as it helps in:it-
Understanding the data structure and identifying inconsistencies.
Detecting missing values, outliers, and unusual patterns.
Selecting appropriate features for predictive modeling. Improving data preprocessing and transformation steps.
Summarize key characteristics of a dataset.
Improving data preprocessing and transformation steps.
NB: By the end of this session, learners will be able to import datasets, clean data by handling missing values and outliers, compute summary statistics, and create visualizations for exploratory data analysis.
2.7 Getting started
Data importation
Set directory
setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")Data importation
# Example 1: Importing an Excel file
# library(readxl)
# data_excel <- read_excel("data.xlsx")
# print(head(data_excel))
# Example 2: Importing a CSV file
gss <- read.csv("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING/GSSsubset.csv")Examine the dataset
#View(gss) # Note: Upper case V in View()
#print(head(gss)) # return the first parts of the dataset
#dim(gss) # 1st: row number; 2nd: column number
#nrow(gss) # number of rows
#ncol(gss) # number of columns
#colnames(gss) # name of columns (the variable names) in dataset
#rownames(gss) # row nameHandling Missing Data
# Example 1: Identify missing values
missing_values <- sum(is.na(gss))
print(missing_values)[1] 0
# Example 2: Remove rows with missing values
clean_data <- na.omit(gss)
print(head(clean_data)) id sex degree income marital age height weight hrswrk
1 1 MALE BACHELOR 60967.50 DIVORCED 53 72 190 60
2 2 FEMALE BACHELOR 60967.50 MARRIED 26 60 97 40
3 4 FEMALE BACHELOR 10161.25 MARRIED 56 68 160 20
4 14 FEMALE HIGH SCHOOL 17551.25 MARRIED 40 65 156 37
5 16 MALE HIGH SCHOOL 17551.25 MARRIED 56 66 210 6
6 19 MALE LT HIGH SCHOOL 15703.75 MARRIED 51 68 170 50
# Example 3: Impute missing values with the mean
gss$age[is.na(gss$age)] <- mean(gss$age, na.rm = TRUE)
#print(head(gss))Handling Outliers
# Example 1: Detect outliers using boxplot
boxplot(gss$income, main = "Income Distribution")# Example 2: Remove outliers using IQR
Q1 <- quantile(gss$income, 0.25)
Q3 <- quantile(gss$income, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.2 * IQR
upper_bound <- Q3 + 1.2 * IQR
data_cleaned <- gss[gss$income >= lower_bound & gss$income <= upper_bound, ]
#print(head(data_cleaned))Summary Statistics
# Example 1: Summary statistics
summary(gss) id sex degree income
Min. : 1.0 Length:994 Length:994 Min. : 369.5
1st Qu.: 648.2 Class :character Class :character 1st Qu.: 15703.8
Median :1254.5 Mode :character Mode :character Median : 27712.5
Mean :1271.2 Mean : 36887.2
3rd Qu.:1915.8 3rd Qu.: 49882.5
Max. :2538.0 Max. :158657.0
marital age height weight
Length:994 Min. :19.00 Min. :57.00 Min. : 90.0
Class :character 1st Qu.:33.00 1st Qu.:64.00 1st Qu.:150.0
Mode :character Median :44.00 Median :67.00 Median :175.0
Mean :44.49 Mean :67.41 Mean :181.3
3rd Qu.:55.00 3rd Qu.:70.00 3rd Qu.:205.0
Max. :79.00 Max. :79.00 Max. :410.0
hrswrk
Min. : 1.00
1st Qu.:38.00
Median :40.00
Mean :42.64
3rd Qu.:50.00
Max. :89.00
# Example 2: Mean, Median, Variance, Standard Deviation
mean_value <- mean(gss $income, na.rm = TRUE)
median_value <- median(gss$income, na.rm = TRUE)
variance_value <- var(gss$income, na.rm = TRUE)
sd_value <- sd(gss$income, na.rm = TRUE)
print(paste("Mean:", mean_value))[1] "Mean: 36887.2183521127"
print(paste("Median:", median_value))[1] "Median: 27712.5"
print(paste("Variance:", variance_value))[1] "Variance: 1204246576.23372"
print(paste("Standard Deviation:", sd_value))[1] "Standard Deviation: 34702.2560683555"
Data Visualization for EDA
# Example 1: Histogram using base R
hist(gss$income,
main = "Income Distribution",
xlab = "Income",
col = "tomato")# Example 2: Boxplot using base R
boxplot(gss$income,
main = "Income Distribution",
ylab = "Income",
col = "green")# Example 3: Scatterplot using ggplot2
library(ggplot2)
ggplot(gss, aes(x = age, y = income)) +
geom_point(color = "red") +
labs(title = "Age vs Income", x = "Age", y = "Income") +
theme_minimal()—————————————
# ---------------------------------------
#Homework
# ---------------------------------------
#Import a Dataset :
#Download a dataset (e.g., from Kaggle or UCI Machine Learning Repository ).
#Import the dataset into R using read.csv() or read_excel().
#Clean the Dataset :
#Handle missing values by either removing them or imputing with the mean/median.
#Detect and treat outliers using the IQR method.
#Perform Exploratory Data Analysis (EDA) :
#Calculate summary statistics (mean, median, variance, standard deviation).
#Create visualizations (histograms, boxplots, scatterplots) to explore relationships in the data.Session 3: Data Manipulation with dplyr
Data manipulation in R refers to the process of cleaning, transforming, organizing, and preparing data so it can be analyzed or visualized effectively.
In simple terms, it means taking raw data and modifying it to get useful information.
Key ideas in data manipulation
Cleaning data – removing errors, duplicates, or missing values
Sorting & filtering – arranging data or selecting only what you need
Transforming data – changing its format (e.g., numbers to percentages, text to categories)
Combining data – merging data from different sources
Summarizing data – calculating totals, averages, etc.
Example
Imagine you have a spreadsheet of student scores:
You remove incorrect entries (cleaning)
Sort scores from highest to lowest (sorting)
Calculate the average score (summarizing)
Where it’s used
Data manipulation is widely used in:
Data science
Statistics
Computer science
Business analytics and reporting
👉 In short: Data manipulation turns raw data into meaningful, usable information.
Package: dplyr
dplyris used for data manipulation and summarization. It helps to:Select variables
Filter rows
Create new variables
Arrange data
Summarize data easily
Install and Load dplyr package
This presentation provides an overview of performing data manipulation in R using dplyr library. It covers key operations such as filtering, selecting specific columns, modifying variables, sorting, summarizing, chaining operations, and dataset reorganization.
Setting Up the Environment
Before performing data manipulation, ensure that you have the required libraries installed:
# install.packages("dplyr")
# Load the library
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Set a working Directory
setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")Import dataset
gss <-read.csv("GSSsubset.csv")Using filter() ***Select rows that meet specific conditions.
# Example: Filter rows where age > 30
gss_filtered <- gss |>
dplyr::filter(age >30)Using select() ***Choose specific columns from the dataset
# Example: Select only 'sex' and 'income' columns
gss_selected <- gss |>
dplyr::select(sex, income)Using mutate() ***Create new variables or modify existing ones
# Example: Add a new column 'salary_category' based on income
gss_Mutated <- gss |>
dplyr::mutate(salary_category = ifelse(income > 50000, "High", "Low"))Using arrange()***Sort rows based on one or more variables.
# Example: Arrange data by descending income
gss_arranged <- gss |>
dplyr::arrange(desc(income))Using summarize()****Compute summary statistics for groups
# Example: Calculate mean income grouped by gender
gss |>
dplyr:: group_by(sex) |>
dplyr:: summarize(mean_income = mean(income, na.rm = TRUE))# A tibble: 2 × 2
sex mean_income
<chr> <dbl>
1 FEMALE 27300.
2 MALE 46096.
Chaining Operations with |> Combine multiple operations into a single pipeline
# Example: Chain multiple operations
gss_processed <- gss |>
dplyr::filter(age > 30) |> # Step 1: Filter rows where age > 30
dplyr:: select(degree, income, sex) |> # Step 2: Select specific columns
dplyr:: mutate(income_category = ifelse(income > 50000, "High", "Low")) |> # Step 3: Add a new column
dplyr:: arrange(desc(income)) # Step 4: Arrange by descending income
#print(head(gss_processed))Reorganize the dataset
# Subsetting data
#gss[1,1] # first row first column
#gss[,1] # first column
#gss[1,] # first row
#gss[,1:2] # first 2 columns
# subsetting by specific criteria
#gss$income
#gss[gss$income > 1e5,]
#gss$degree
#gss[gss$degree == "GRADUATE",]
#gss$marital
#gss[gss$marital=="DIVORCED",]Summary report
summary(gss) id sex degree income
Min. : 1.0 Length:994 Length:994 Min. : 369.5
1st Qu.: 648.2 Class :character Class :character 1st Qu.: 15703.8
Median :1254.5 Mode :character Mode :character Median : 27712.5
Mean :1271.2 Mean : 36887.2
3rd Qu.:1915.8 3rd Qu.: 49882.5
Max. :2538.0 Max. :158657.0
marital age height weight
Length:994 Min. :19.00 Min. :57.00 Min. : 90.0
Class :character 1st Qu.:33.00 1st Qu.:64.00 1st Qu.:150.0
Mode :character Median :44.00 Median :67.00 Median :175.0
Mean :44.49 Mean :67.41 Mean :181.3
3rd Qu.:55.00 3rd Qu.:70.00 3rd Qu.:205.0
Max. :79.00 Max. :79.00 Max. :410.0
hrswrk
Min. : 1.00
1st Qu.:38.00
Median :40.00
Mean :42.64
3rd Qu.:50.00
Max. :89.00
#aggregate(income ~ sex, data = gss, mean) #mean income for each gender
#aggregate(income ~ sex, data = gss, max) # maximum income for each gender
#aggregate(income ~ sex + degree, data = gss, mean) #mean income by gender and level of education
#aggregate(income ~ marital + sex + age, data = gss, mean)# -----------------------------------------------
#Homework
# ----------------------------------------------
#Practice dplyr Functions :
#Use the dplyr package to manipulate a "car"dataset.
#Perform the following tasks:
#Filter rows based on specific conditions.(Wheelbase>110)
#Select specific columns.
#Create or modify variables using mutate().
#Sort rows using arrange().
#Compute summary statistics using summarize().Session 4: Data Visualization in R with ggplot2
4.1 Data Visualization
Data visualization helps in understanding patterns, trends, and relationships in data.
It is a crucial element in scientific research, enabling researchers to interpret and communicate their results effectively
4.2 Types of Data Visualization
1. Univariate Data Visualizations (Single Variable)
✔️ Histogram: Used for understanding the distribution of a single variable.
✔️ Box Plot: Used for Detecting outliers and understanding the spread of data.
2. Bivariate Data Visualizations (Two Variables)
✔️ Scatter Plot: Used for understanding relationships between two numerical variables.
✔️ Line Plot: Used for showing trends over time or continuous data.
✔️ Bar Chart: Used for comparing categorical data.
3. Multivariate Data Visualizations (More than Two Variables)
✔️ Heatmap: Used for visualizing correlations between multiple numerical variables.
✔️ Pair Plot: Used for visualizing pairwise relationships in the dataset.
✔️ Violin Plot: Used for understanding the distribution of a variable across categories.
4. Specialized Data Visualizations
✔️ Pie Chart: Used for representing proportions.
✔️ Bubble Chart: Used for adding a third variable to a scatter plot(Comparing three numerical variables)
✔️ Word Cloud: used to highlight keywords, trends, or themes in textual data (Text Analysis : Highlighting key terms in articles, reviews, or social media posts.)
5. Time Series Visualizations
✔️ Time Series Line plot: Used to analyze trends, patterns, or changes in data over a continuous period (e.g., days, months, years).
✔️ Autocorrelation Plot: Used for finding patterns in time series data.
4.3 Best Practices
✅ Choose the Right Chart Type (e.g., bar charts for categories, line charts for trends).
✅ Follow Design Principles (simplicity, consistency, accessibility).
✅ Use Storytelling to highlight key insights and structure visuals logically.
✅ Avoid Common Pitfalls (misleading scales, cluttered visuals, unnecessary 3D charts).
4.5 Data Visualization with ggplot2
What is ggplot2?
A powerful plotting system for R, based on the Grammar of Graphics
Developed by Hadley Wickham
Allows building complex, publication-quality graphics in layers.
ggplot2 is the gold standard when it comes to data visualization.
ggplot2is used for drawing graphs and charts in a clear and attractive way. It helps students create: Bar charts, Histograms, Boxplots, Scatter plots, Line graphs etc.Here’s why:
✔️ Consistent, intuitive syntax that makes it easy to learn and use across various plot types.
✔️ Seamless integration with other tidyverse packages, enabling smooth data workflows.
✔️ Supports faceting, grouping, and mapping aesthetic.
✔️ Produces professional-quality visuals.
✔️ Efficient handling of large data sets, ensuring smooth and responsive plotting even with complex data.
✔️ Over 100 extensions that enhance its core capabilities, providing endless options for creative visualizations.
✔️ Trusted by more than 1,000 packages, ensuring reliability and broad support.
Remarks:
It breaks down the process of data visualization into layers, making it easier to customize & understand how to build effective charts. Layers are added using the ‘+’ operator.
Essential layers used to create a plot:
1️ Data: The foundation, where you start by defining the data set.
2️ Aesthetics: Map variables to visual aspects like color, size, and position.
3️⃣ Geometries: Specify the type of plot you want, such as bar, line, or scatter.
4️⃣ Facets: Create subplots for different subsets of your data.
5️⃣ Statistics: Add statistical transformations, like mean lines or trend lines.
6️⃣ Coordinates: Control the plot’s coordinate system, such as flipping axes.
7️⃣ Theme: Adjust the overall appearance, like grid lines, font styles, and background.
Install and Load ggplot2 package
#install.packages("ggplot2")
# Load the library
library(ggplot2)Set a working Directory
setwd("C:/Users/Admin/Documents/DAM/CDAM/2025/R_TRAINING")Import the dataset
gss <-read.csv("GSSsubset.csv")The data
All ggplot2 plots require a data frame as input.
Just running this line will produce a blank plot because we have not stated which elements from the data we want to visualize or how we want to visualize them.
# Initialize a ggplot object with data
# ggplot(gss) # data: This will show an empty plot as no geom is added yet
gss |> ggplot() The aesthetics
Next, we need to specify the visual properties of the plot that are determined by the data.
The aesthetics are specified using the aes() function.
The output should now produce a blank plot but with determined visual properties (e.g., axes labels).
gss |> # data
ggplot(aes(x = age, y = income)) # aestheticsThe geometries
Finally, we need to specify the visual representation of the data. The geometries are specified using the geom_*() function.
There are many different types of geometries that can be used in ggplot2.
We will use geom_point() in this example and we will append it to the previous plot using the + operator.
The output should now produce a plot with the specified visual representation of the data.
use geom_point()
gss |> # data
ggplot(aes(x = age, y = income)) + # aesthetics
geom_point() # geometry# change the color of point to my choice
gss |> # data
ggplot(aes(x = age, y = income)) + # aesthetics
geom_point(color = "Tomato") # geometry (change the color of point to my choice)#color point in the plot by marital
gss |> # data
ggplot(aes(x = age, y = income,colour = marital )) + # aesthetics (color point in the plot by marital)
geom_point() # geometry# change the point size in the plot
gss |> # data
ggplot(aes(x = age, y = income,colour = marital )) + # aesthetics (color point in the plot by marital)
geom_point(size = 4) # geometry(change the point size in the plot )Histogram: Used for understanding the distribution of a single variable.
We will use geom_histogram() to the plot using the + operator.
use geom_histogram()
gss |> # data
ggplot(aes(x = income)) + # aesthetics
geom_histogram() # geometries`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# change the color
gss |>
ggplot(aes(x = income)) +
geom_histogram(fill = "red") `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# Change the color & specifies the number of bars
gss |>
ggplot(aes(x = income)) +
geom_histogram(fill = "red", bins = 5) # Change the color & specifies the number of bars
gss |>
ggplot(aes(x = income)) +
geom_histogram(fill = "red", bins = 5, color = "black") Labels: Adding Titles and Labels
Clear titles and labels are essential for making your plots understandable.
Labels can be added to various components of a plot using the labs() function.
# Labels: Adding Titles and Labels
gss |>
ggplot(aes(x = income)) +
geom_histogram(fill = "red", bins = 5, color = "black") +
labs(x = "Income in Kshs",
y = "No. of Respondents",
title = "Histogram showing income Distribution",
caption = "Source: CDAM Experts, 2026") +
theme_classic()Bar Chart: Used for comparing categorical data.
use geom_bar()
# Create a bar plot
gss |> # data
ggplot(aes(x= age)) + # aesthetics
geom_bar() # geometrics# geometrics (the "fill" argument specifies the color of the bars)
gss |>
ggplot(aes(x= age)) +
geom_bar(fill = "blue") # Labels: Adding Titles and Labels
gss |>
ggplot(aes(x= age)) +
geom_bar(fill = "blue") +
labs(x = "Age in Years",
y = "No. of Respondents",
title = "A Bar chart showing age Distribution",
caption = "Source: CDAM Experts, 2026") +
theme_classic()Boxplot: Used for Detecting outliers and understanding the spread of data.
We will use geom_boxplot() using the + operator.
use geom_boxplot()
# create boxplot chart
gss |>
ggplot(aes(x= degree, y = income)) +
geom_boxplot()# geometrics (the "fill" argument specifies the color of the boxplot)
gss |>
ggplot(aes(x= degree, y = income)) +
geom_boxplot(fill = "red")# The "fill" argument specifies the color of boxplot by sex)
gss |>
ggplot(aes(x= degree, y = income, fill = sex)) +
geom_boxplot()# adds random noise (jitter) to points in a scatter plot to reduce overplotting when many points overlap
gss |>
ggplot(aes(x= degree, y = income, fill = sex)) +
geom_boxplot() +
geom_jitter(alpha = 0.2) # Controls transparency (opacity) of points[from 0 (completely transparent) to 1 (fully opaque)]# Labels: Adding Titles and Labels
gss |>
ggplot(aes(x= degree, y = income, fill = sex)) +
geom_boxplot() +
labs(x = "Education Qualification",
y = "Income Levels",
title = "Boxplot showing of income Distribution by Degree",
caption = "Source: CDAM Experts, 2026") +
theme_classic()gss |>
ggplot(aes(x= degree, y = income, fill = sex)) +
geom_boxplot() +
labs(x = "Education Qualification",
y = "Income Levels",
title = "Boxplot showing of income Distribution by Degree",
caption = "Source: CDAM Experts, 2026") +
theme_classic() +
theme(legend.position = "top")Themes
The “theme” function is used to specify the theme of the plot.
There are many preset theme functions, and further custom themes can be created using the generic theme() function.
There are many different themes that can be used in ggplot2.
Typically you will want to set the theme at the end of your plot
Scatter Plot***Visualize the relationship between two continuous variables.
# Example: Scatter plot of 'income' vs 'age'
gss |>
ggplot(aes(x = age, y = income)) +
geom_point(color = "blue") +
labs(title = "Income vs Age",
x = "Age",
y = "Income") +
theme_minimal()Faceting: Creating Small Multiples
Facets are a powerful feature of ggplot2 that allow us to create multiple plots based on a single variable.
This “small multiple” approach making it easy to compare distributions or relationships across groups.
Facets also make use of the ~ operator.
gss |>
ggplot(aes(x = age, y = income)) +
geom_point(color = "blue") +
facet_wrap(~ sex) +
labs(title = "Income vs Age",
x = "Age",
y = "Income") +
theme_classic()# Create multiple bar plots using facet_wrap()
gss |>
ggplot(aes(x = income)) +
geom_histogram(fill = "red", bins = 5, color = "black") +
facet_wrap(~ sex) +
labs(x = "Income in Kshs",
y = "No. of Respondents",
title = "Histogram showing income Distribution",
caption = "Source: CDAM Experts, 2026") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) # Align the title to the centerExporting plots
We can export plots to a variety of formats using the ggsave() function.
We can specify which plot to export by saving in an object and then calling the object in the ggsave() function, otherwise ggsave() will save the current/last plot.
The width and height of the output image using the width and height can be set using the width and height arguments, and the resolution of the image using the dpi argument.
The file type can be set using the format argument, or by using a specific file extension.
I recommend using informative names for the output file so that it is easily identifiable.
gss |>
ggplot(aes(x = income)) +
geom_histogram(fill = "blue", bins = 5, color = "black") +
facet_wrap(~ sex) +
labs(x = "Income in Kshs",
y = "No. of Respondents",
title = "Histogram showing income Distribution",
caption = "Source: CDAM Experts, 2026") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) # Align the title to the centerplot_1 <-ggsave("Plot_1.png", width = 10, height = 6, dpi = 300)Interactive Visualizations: Use tools like Plotly to enable user interaction.
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Create a ggplot
p = gss |>
ggplot(aes(x= degree, y = income, fill = sex)) +
geom_boxplot() +
labs(x = "Education Qualification",
y = "Income Levels",
title = "Boxplot showing of income Distribution by Degree",
caption = "Source: CDAM Experts, 2026") +
theme_classic() +
theme(legend.position = "top")
# Convert to interactive plot
ggplotly(p)Violin plots
Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values.
Typically, violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots.
Key function:
geom_violin(): Creates violin plots. Key arguments:
fill: Areas fill color
gss |> ggplot(aes(x = age, y = income)) +
geom_violin(fill = "tomato") +
geom_jitter(alpha = 0.2) +
theme_classic()# -----------------------------------------------
#Homework
#Dataset:mockdata_cases.csv provided
# -----------------------------------------------
#Create Visualizations for a Dataset :
#Use the ggplot2 package to create visualizations for a dataset of your choice.
#Perform the following tasks:
#Create a scatter plot to visualize the relationship between two continuous variables.
#Create a bar plot to display the count or summary of a categorical variable.
#Create a histogram to show the distribution of a single variable.
#Create a boxplot to summarize the distribution of a continuous variable across categories.
#Customize your plots with titles, labels, themes, and aesthetic modifications.Session 5: Hypothesis Testing in R
5.1 Learning Objectives
By the end of this session, learners should be able to:
Understand the logic and framework of hypothesis testing
Formulate null and alternative hypotheses correctly
Select appropriate statistical tests
Perform hypothesis tests in R
Interpret results in a statistically sound manner
5.2 Conceptual Foundation
5.2.1 What is Hypothesis Testing?
Hypothesis testing is a statistical inference method used to make decisions about a population parameter based on sample data.
It answers:
Is the observed effect real, or due to random chance?
5.2.2 Key Terminology
| Term | Description |
|---|---|
| Null Hypothesis (H₀) | Assumes no effect or no difference |
| Alternative Hypothesis (H₁) | Assumes there is an effect or difference |
| Significance Level (α) | Probability of rejecting H₀ when it is true (commonly 0.05) |
| p-value | Probability of observing results at least as extreme as the sample |
| Test Statistic | Value calculated from sample data |
| Type I Error | Rejecting a true H₀ |
| Type II Error | Failing to reject a false H₀ |
5.2.3 General Steps in Hypothesis Testing
Define hypotheses (H₀ and H₁)
Choose significance level (α)
Select appropriate test
Compute test statistic
Calculate p-value
Make decision:
- If p ≤ α → Reject H₀
- If p \> α → Fail to reject H₀
One-Sample t-Test
Test whether the mean of a sample is significantly different from a specified value.
# Example: One-sample t-test
sample_data = c(22, 24, 26, 28, 30, 32, 34, 36)
known_mean = 30
# Perform one-sample t-test
result <- t.test(sample_data, mu = known_mean)
# Print results
print(result)
One Sample t-test
data: sample_data
t = -0.57735, df = 7, p-value = 0.5818
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
24.90435 33.09565
sample estimates:
mean of x
29
Two-Sample t-Test
Test whether the means of two independent samples are significantly different.
# Example: Two-sample t-test
group1 <- c(22, 24, 26, 28, 30)
group2 <- c(32, 34, 36, 38, 40)
# Perform two-sample t-test
result = t.test(group1, group2)
# Print results
print(result)
Welch Two Sample t-test
data: group1 and group2
t = -5, df = 8, p-value = 0.001053
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-14.612008 -5.387992
sample estimates:
mean of x mean of y
26 36
Paired t-Test
Test whether the means of two related samples are significantly different.
# Example: Paired t-test
before <- c(21, 24, 26, 28, 30)
after <- c(24, 26, 27, 30, 32)
# Perform paired t-test
result <- t.test(before, after, paired = TRUE)
# Print results
print(result)
Paired t-test
data: before and after
t = -6.3246, df = 4, p-value = 0.003198
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-2.877989 -1.122011
sample estimates:
mean difference
-2
Chi-Square Test for Independence
Test whether there is a significant association between two categorical variables.
# Example: Chi-square test for independence
data <- matrix(c(20, 10, 15, 25), nrow = 2, byrow = TRUE)
rownames(data) <- c("Group1", "Group2")
colnames(data) <- c("CategoryA", "CategoryB")
data CategoryA CategoryB
Group1 20 10
Group2 15 25
# Perform chi-square test
result <- chisq.test(data)
# Print results
print(result)
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 4.725, df = 1, p-value = 0.02973
5.3 Dataset Context: GSSsubset.csv
The dataset is a subset of the General Social Survey (GSS), containing:
Categorical variables: gender, marital status, education level, political views
Numerical variables: age, income, years of education
5.4 Loading and Inspecting Data in R
# Load required libraries
library(tidyverse) ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.4 ✔ tidyr 1.3.1
✔ readr 2.1.5
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ plotly::filter() masks dplyr::filter(), stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Import dataset
gss <- read.csv("GSSsubset.csv")
# Inspect structure
str(gss) 'data.frame': 994 obs. of 9 variables:
$ id : int 1 2 4 14 16 19 21 27 28 30 ...
$ sex : chr "MALE" "FEMALE" "FEMALE" "FEMALE" ...
$ degree : chr "BACHELOR" "BACHELOR" "BACHELOR" "HIGH SCHOOL" ...
$ income : num 60968 60968 10161 17551 17551 ...
$ marital: chr "DIVORCED" "MARRIED" "MARRIED" "MARRIED" ...
$ age : int 53 26 56 40 56 51 30 35 57 54 ...
$ height : int 72 60 68 65 66 68 62 70 71 71 ...
$ weight : int 190 97 160 156 210 170 115 180 225 165 ...
$ hrswrk : int 60 40 20 37 6 50 38 40 40 40 ...
glimpse(gss)Rows: 994
Columns: 9
$ id <int> 1, 2, 4, 14, 16, 19, 21, 27, 28, 30, 32, 38, 40, 44, 45, 46, 4…
$ sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "FEMALE"…
$ degree <chr> "BACHELOR", "BACHELOR", "BACHELOR", "HIGH SCHOOL", "HIGH SCHOO…
$ income <dbl> 60967.50, 60967.50, 10161.25, 17551.25, 17551.25, 15703.75, 17…
$ marital <chr> "DIVORCED", "MARRIED", "MARRIED", "MARRIED", "MARRIED", "MARRI…
$ age <int> 53, 26, 56, 40, 56, 51, 30, 35, 57, 54, 61, 31, 35, 26, 50, 43…
$ height <int> 72, 60, 68, 65, 66, 68, 62, 70, 71, 71, 64, 67, 69, 76, 67, 68…
$ weight <int> 190, 97, 160, 156, 210, 170, 115, 180, 225, 165, 128, 150, 200…
$ hrswrk <int> 60, 40, 20, 37, 6, 50, 38, 40, 40, 40, 40, 39, 50, 45, 60, 40,…
#summary(gss) Key Checks
Identify variable types (
numeric,factor)Check missing values
# Check missing values
colSums(is.na(gss)) id sex degree income marital age height weight hrswrk
0 0 0 0 0 0 0 0 0
5.5 Hypothesis Testing Framework
5.5.1 General Structure
H₀ (Null): No effect / no difference
H₁ (Alternative): There is an effect / difference
5.5.2 Decision Rule
Reject H₀ if p-value ≤ 0.05
Otherwise, fail to reject H₀
5.6 Selecting the Right Test
| Scenario | Variables | Test |
|---|---|---|
| Mean vs constant | Numeric | One-sample t-test |
| Mean difference (2 groups) | Numeric + categorical | Two-sample t-test |
| Paired observations | Numeric (paired) | Paired t-test |
| Association | Categorical | Chi-square test |
| Correlation | Numeric + numeric | Correlation test |
5.7 Hands-On Analysis
5.7.1 One-Sample t-Test
Used to compare sample mean with a known value.
Research Question: Is the average age different from 40?
t.test(gss$age, mu = 40)
One Sample t-test
data: gss$age
t = 10.781, df = 993, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 40
95 percent confidence interval:
43.67189 45.30598
sample estimates:
mean of x
44.48893
Interpretation:
Check p-value
If p < 0.05 → Mean age significantly differs from 40
5.7.2 Two-Sample t-Test
Compares means between two independent groups.
Research Question: Does income differ by gender?
t.test(income ~ sex, data = gss)
Welch Two Sample t-test
data: income by sex
t = -8.9504, df = 825.42, p-value < 2.2e-16
alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
95 percent confidence interval:
-22917.25 -14673.48
sample estimates:
mean in group FEMALE mean in group MALE
27300.45 46095.81
Key Output Components: -8.9504
Mean of each group
Confidence interval
p-value
Reporting Results (Standard Format)
Example:
A two-sample t-test was conducted to compare income between gender. The results showed a statistically significant difference (t = -8.9504, df = 825.42, p-value < 2.2e-16). Therefore, we reject the null hypothesis and conclude that income differs by gender.
5.7.3 Paired t-Test
Used when observations are dependent (e.g., before vs after).
Only if dataset has repeated measures.
before <- c(120, 130, 125, 140)
after <- c(115, 128, 120, 135)
t.test(before, after, paired = TRUE)
Paired t-test
data: before and after
t = 5.6667, df = 3, p-value = 0.01088
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
1.863165 6.636835
sample estimates:
mean difference
4.25
5.7.4 Chi-Square Test (Categorical Association)
Used for categorical data (association or independence).
# Contingency table
data <- matrix(c(10, 20, 30, 40), nrow = 2)
chisq.test(data)
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 0.44643, df = 1, p-value = 0.504
Research Question: Is education level associated with marital status?
table_data <- table(gss$degree, gss$marital)
chisq.test(table_data)Warning in chisq.test(table_data): Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: table_data
X-squared = 41.217, df = 16, p-value = 0.0005158
Interpretation:
- p < 0.05 → variables are dependent
5.7.5 Correlation Test
Research Question: Is age correlated with income?
cor.test(gss$age, gss$income)
Pearson's product-moment correlation
data: gss$age and gss$income
t = 6.8048, df = 992, p-value = 1.748e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1509841 0.2698184
sample estimates:
cor
0.2111815
5.8 Assumption Checking
5.8.1 Normality (for t-tests)
shapiro.test(gss$age)
Shapiro-Wilk normality test
data: gss$age
W = 0.97874, p-value = 7.129e-11
Or visually:
hist(gss$age)qqnorm(gss$age)
qqline(gss$age)5.8.2 Equal Variance (F-test)
var.test(income ~ sex, data = gss)
F test to compare two variances
data: income by sex
F = 0.34694, num df = 486, denom df = 506, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2909261 0.4138759
sample estimates:
ratio of variances
0.3469425
5.8.3. Non-Parametric Alternative
wilcox.test(income ~ sex, data = gss)
Wilcoxon rank sum test with continuity correction
data: income by sex
W = 84528, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Practical Exercise (In-Class)
Task Set
- Test whether:
- Mean age differs from 35
- Income differs by education level
- Check association:
- Gender vs marital status
- Compute correlation:
- Age vs education years
Homework Assignment
Part A: Hypothesis Development
Formulate 3 research questions from GSS dataset
Define H₀ and H₁ clearly
Part B: Implementation in R
Perform:
One t-test
One chi-square test
One correlation test
Part C: Interpretation
For each:
Test statistic
p-value
Decision
Real-world interpretation
Session 6: Correlation and Regression Analysis
6.1 Correlation analysis
Correlation measures the strength and direction of a linear relationship between two variables.
Range: From -1 to +1.
+1: Perfect positive relationship (both increase together).
-1: Perfect negative relationship (one increases while the other decreases).
0: No linear relationship.
Common Measures:
Pearson’s correlation coefficient (r): For continuous, normally distributed data.
Spearman’s rank correlation: For ranked/ordinal data.
Kendall’s tau: For ordinal data with ties.
Example: Hours studied vs. exam score → r = 0.915 indicates a strong positive correlation.
Hands-On Exercises
# Example: Pearson correlation
data <- data.frame(
x = c(1, 2, 3, 4, 5),
y = c(2, 4, 5, 7, 6))
head(data) x y
1 1 2
2 2 4
3 3 5
4 4 7
5 5 6
# Calculate Pearson correlation
pearson_corr <- cor(data$x, data$y, method = "pearson")
print(paste("Pearson Correlation:", pearson_corr))[1] "Pearson Correlation: 0.904194430179465"
# Test significance of Pearson correlation
pearson_test <- cor.test(data$x, data$y, method = "pearson")
print(pearson_test)
Pearson's product-moment correlation
data: data$x and data$y
t = 3.6667, df = 3, p-value = 0.03508
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1084151 0.9937257
sample estimates:
cor
0.9041944
data <- data.frame(
x = c(1, 2, 3, 4, 5),
y = c(2, 4, 5, 7, 6))
# Example: Spearman correlation
spearman_corr <- cor(data$x, data$y, method = "spearman")
print(paste("Spearman Correlation:", spearman_corr))[1] "Spearman Correlation: 0.9"
# Test significance of Spearman correlation
spearman_test <- cor.test(data$x, data$y, method = "spearman")
print(spearman_test)
Spearman's rank correlation rho
data: data$x and data$y
S = 2, p-value = 0.08333
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.9
6.2 Regression Analysis
Regression models the relationship between a dependent variable (response) and one or more independent variables (predictors).
Purpose: Explains and predicts how changes in X affect Y.
Types:
Simple Linear Regression: One predictor, equation form: Y=a+bX.
Multiple Linear Regression: Several predictors.
Logistic Regression: For categorical outcomes.
Polynomial Regression: Models nonlinear relationships.
Interpretation Example: Predicted exam score = 65.47 + 2.58 × (hours studied).
Intercept (65.47): Expected score with zero study hours.
Slope (2.58): Average score increase per extra study hour.
Hands-On Exercises
Simple Linear Regression***Use lm() to fit a simple linear regression model
# sample dataset
data <- data.frame(
x = c(1, 2, 3, 4, 5),
y = c(2, 4, 5, 7, 6))
# Example: Simple linear regression
model <- lm(y ~ x, data = data)
# Print summary of the model
summary(model)
Call:
lm(formula = y ~ x, data = data)
Residuals:
1 2 3 4 5
-0.6 0.3 0.2 1.1 -1.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.500 0.995 1.508 0.2288
x 1.100 0.300 3.667 0.0351 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9487 on 3 degrees of freedom
Multiple R-squared: 0.8176, Adjusted R-squared: 0.7568
F-statistic: 13.44 on 1 and 3 DF, p-value: 0.03508
### Extracting the Parameters from the Model
model$coefficients(Intercept) x
1.5 1.1
summary(model)$r.square[1] 0.8175676
Multiple Linear Regression***Extend the model to include multiple independent variables.
# Example: Multiple linear regression
data <- data.frame(
y = c(2, 4, 5, 4, 6),
x1 = c(1, 2, 3, 4, 5),
x2 = c(3, 5, 7, 6, 8))
# Fit the model
model <- lm(y ~ x1 + x2, data = data)
# Print summary of the model
summary(model)
Call:
lm(formula = y ~ x1 + x2, data = data)
Residuals:
1 2 3 4 5
-0.06667 0.33333 -0.26667 -0.20000 0.20000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4222 0.6748 -0.626 0.5954
x1 -0.1778 0.2703 -0.658 0.5784
x2 0.8889 0.2222 4.000 0.0572 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3651 on 2 degrees of freedom
Multiple R-squared: 0.9697, Adjusted R-squared: 0.9394
F-statistic: 32 on 2 and 2 DF, p-value: 0.0303
Interpreting Results***Understand coefficients, R-squared, and p-values.
# Example: Interpreting regression results
summary(model)$coefficients # View coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4222222 0.6747656 -0.6257317 0.59537818
x1 -0.1777778 0.2703450 -0.6575959 0.57836298
x2 0.8888889 0.2222222 4.0000000 0.05719096
summary(model)$r.squared # View R-squared value[1] 0.969697
Model Diagnostics
# Example: Residual analysis: Plot residuals to check for patterns.
par(mfrow = c(1, 2)) # Set up a 1x1 plot layout
plot(model) # Generate diagnostic plots# Example: Multicollinearity (VIF): Use Variance Inflation Factor (VIF) to detect multicollinearity.
library(car) # Install and load the 'car' package if not already installedLoading required package: carData
Attaching package: 'car'
The following object is masked from 'package:purrr':
some
The following object is masked from 'package:dplyr':
recode
vif(model) # Calculate VIF values x1 x2
5.481481 5.481481