——————————————————————————————–

Facilitator: CDAM Experts

——————————————————————————————–

Course Outline: Beginner Level R for Data Science

Session 1: Introduction to Data science, R, RStudio, and Basic Data Types

Session 2: Data Import, Cleaning, and Exploratory Data Analysis (EDA)

Session 3: Data Manipulation with dplyr

Session 4: Data Visualization with ggplot2

Session 5: Probability Distributions and Random Variables

Session 6: Hypothesis Testing

Session 7: Regression Analysis, Correlation and Time Series

Session 8: Analysis of Variance (ANOVA) and Non-Parametric Tests

Session 9: Reporting with RMarkdown

Session 10: Capstone Project

Session 1: Introduction to Data Science, R, RStudio, and Basic Data Types

Learning Objectives:

By the end of this session, you will be able to:

1. What is Data Science?

Definition

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

Aims of Data Science

  1. Data Exploration & Analysis – Discover patterns and trends.

  2. Predictive Modeling – Forecast future outcomes using machine learning.

  3. Decision-Making – Support business and scientific decisions with data.

  4. Automation – Build intelligent systems (e.g., recommendation engines).

  5. Visualization – Communicate insights effectively.

Applications of Data Science

Business: Customer segmentation, fraud detection, sales forecasting.

Healthcare: Disease prediction, drug discovery, medical imaging.

Finance: Risk assessment, algorithmic trading, credit scoring.

Marketing: Sentiment analysis, personalized recommendations.

Social Media: Trend analysis, user behavior modeling.

2. Overview of R and RStudio

What is R?

What is RStudio?

NB: Think of R as the engine and RStudio as the dashboard — both together make driving (coding) easier and more efficient

3. Installing R and RStudio

Step-by-step Installation:

Step 1. Install R :

Step 2. Install RStudio :

Step 3. Launch RStudio :

  • After installation, open RStudio.

  • You’ll see multiple panes: Console, Script Editor, Environment, etc.

Step 4. RStudio Interface Overview

  • Console: Where you type commands and see immediate output.

  • Script Editor: Write and save R scripts here (File > New File > R Script)

  • Environment/History: Lists objects (variables) created and command history

  • Files/Plots/Packages/Help: File browser, plot viewer, package manager, help documentation

Getting Started

Before you begin, you might want to create a new project in RStudio. A project is a self-contained working environment that helps you manage your work efficiently. It includes your R scripts, datasets, outputs, and settings in one place.You can name the project and choose a directory to save it in.

set a working directory Default location where R looks for files and saves outputs

setwd("~/2025_R_TRAINING") # It tells R where to look for files and where to save files

4. Basic Operations and Functions

Arithmetic Operators

x = 10
y = 3

print(x + y)   # Addition
## [1] 13
print(x - y)   # Subtraction
## [1] 7
print(x * y)   # Multiplication
## [1] 30
print(x / y)   # Division
## [1] 3.333333

Logical Operators

a <- 5
b <- 10

print(a > b)           # FALSE
## [1] FALSE
print(a == 5 & b > 5)  # AND
## [1] TRUE
print(a == 5 | b < 5)  # OR
## [1] TRUE
print(a!=5)            # a is not equal to 5
## [1] FALSE
print(a==5)            # a is equal to 5
## [1] TRUE

Built-in Functions

sum(c(1, 2, 3))         # Sum
## [1] 6
mean(c(2, 4, 6))        # Mean
## [1] 4
sd(c(2, 4, 6))          # Standard deviation
## [1] 2
min(c(10, 20, 5))       # Minimum
## [1] 5
max(c(10, 20, 5))       # Maximum
## [1] 20
length(c(1, 2, 3))      # Length
## [1] 3
seq(1, 10, by = 2)      # Sequence
## [1] 1 3 5 7 9
rep("hello", times = 3) # Repeat
## [1] "hello" "hello" "hello"

5. Basic R Syntax and Data Types:

R has several basic data structures

1. Vectors*

A vector is the simplest data structure in R. It contains elements of the same type.

# Create a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)

# Create a character vector
char_vector <- c("apple", "banana", "cherry")

# Create a logical vector
logical_vector <- c(TRUE, FALSE, TRUE)

# Print vector
print(numeric_vector)
## [1] 1 2 3 4 5
print(char_vector)
## [1] "apple"  "banana" "cherry"
print(logical_vector)
## [1]  TRUE FALSE  TRUE

Attention: All elements must be the same type; if mixed, coercion occurs

mixed_vector <- c(1, "two", TRUE)
print(mixed_vector)  # All converted to characters
## [1] "1"    "two"  "TRUE"

Useful Functions for Vectors

# Create a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)

# Create a character vector
char_vector <- c("apple", "banana", "cherry")

# Create a logical vector
logical_vector <- c(TRUE, FALSE, TRUE)

length(numeric_vector)       # Length of the vector
## [1] 5
typeof(char_vector)       # Type of elements
## [1] "character"
is.vector(logical_vector)    # Check if object is a vector
## [1] TRUE
vec_num <- c(1, 2, 3)
vec_char <- c("apple", "banana")
vec_logical <- c(TRUE, FALSE, TRUE)

length(vec_num)       # Length of the vector
## [1] 3
typeof(vec_num)       # Type of elements
## [1] "double"
is.vector(vec_num)    # Check if object is a vector
## [1] TRUE

2. Matrices

A matrix is a 2D vector with rows and columns. All elements must be of the same type.

# Create a matrix from a vector
mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
print(mat)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# Access elements
print(mat[1, 2])  # First row, second column
## [1] 3

Naming Rows and Columns

rownames(mat) <- c("Row1", "Row2")
colnames(mat) <- c("Col1", "Col2", "Col3")
print(mat)
##      Col1 Col2 Col3
## Row1    1    3    5
## Row2    2    4    6

Indexing in Matrices

mat[1, 2]     # First row, second column
## [1] 3
mat[2, ]      # Entire second row
## Col1 Col2 Col3 
##    2    4    6
mat[, 3]      # Entire third column
## Row1 Row2 
##    5    6

3. Lists

A list can contain elements of different types and even other lists.

my_list <- list(
  name = "John", 
  age = 30, 
  grades = c(85, 90, 78))


print(my_list)
## $name
## [1] "John"
## 
## $age
## [1] 30
## 
## $grades
## [1] 85 90 78
# Accessing List Elements
my_list$name           # By name
## [1] "John"
my_list[[3]]           # By index
## [1] 85 90 78
my_list$grades[2]      # Second score
## [1] 90

4. Data Frames – Tabular Data Structure

A data frame is like a spreadsheet or SQL table — rows represent observations, columns represent variables.

Creating a Data Frame

# Create a data frame
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "Ann"),
  Age = c(25, 30, 35, 26),
  Salary = c(50000, 60000, 70000, 40000))

print(df)
##      Name Age Salary
## 1   Alice  25  50000
## 2     Bob  30  60000
## 3 Charlie  35  70000
## 4     Ann  26  40000

Inspecting a Data Frame

str(df)         # Structure of the data frame
## 'data.frame':    4 obs. of  3 variables:
##  $ Name  : chr  "Alice" "Bob" "Charlie" "Ann"
##  $ Age   : num  25 30 35 26
##  $ Salary: num  50000 60000 70000 40000
summary(df)     # Summary statistics
##      Name                Age            Salary     
##  Length:4           Min.   :25.00   Min.   :40000  
##  Class :character   1st Qu.:25.75   1st Qu.:47500  
##  Mode  :character   Median :28.00   Median :55000  
##                     Mean   :29.00   Mean   :55000  
##                     3rd Qu.:31.25   3rd Qu.:62500  
##                     Max.   :35.00   Max.   :70000
head(df)        # First few rows
##      Name Age Salary
## 1   Alice  25  50000
## 2     Bob  30  60000
## 3 Charlie  35  70000
## 4     Ann  26  40000
dim(df)         # Dimensions (rows x columns)
## [1] 4 3

Adding and Removing Columns

# Add a new column
df$Department <- c("HR", "Finance", "IT", "Audit")

# Remove a column
df$salary <- NULL
print(df)
##      Name Age Salary Department
## 1   Alice  25  50000         HR
## 2     Bob  30  60000    Finance
## 3 Charlie  35  70000         IT
## 4     Ann  26  40000      Audit

Filtering Rows

# Filter rows where Age > 30
filtered_df <- subset(df, Age > 30)
print(filtered_df)
##      Name Age Salary Department
## 3 Charlie  35  70000         IT

6. Hands-On Practice

Task 1: Create and Manipulate Vectors

# Create two vectors
vec1 <- c(10, 20, 30)
vec2 <- c("red", "green", "blue")

# Concatenate them
combined_vec <- c(vec1, vec2)
print(combined_vec)
## [1] "10"    "20"    "30"    "red"   "green" "blue"
# Find the length
print(length(combined_vec))
## [1] 6
# Coerce numeric to character
as.character(vec1)
## [1] "10" "20" "30"

Exercise 1: Working with Vectors

# Create two numeric vectors
vec1 <- c(10, 20, 30)
vec2 <- c(40, 50, 60)

# Concatenate them
combined_vec <- c(vec1, vec2)
print(combined_vec)
## [1] 10 20 30 40 50 60
# Convert to character and print
as.character(combined_vec)
## [1] "10" "20" "30" "40" "50" "60"

Task 2: Matrix Creation and Indexing

# Create a 3x3 matrix
mat <- matrix(seq(1, 9), nrow = 3, ncol = 3)
print(mat)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# Extract element at position (2,3)
print(mat[2, 3])
## [1] 8

Exercise 2: Matrix Practice

# Create a 3x3 matrix with values from 1 to 9
mat <- matrix(seq(1, 9), nrow = 3, ncol = 3)
print(mat)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# Extract diagonal elements
diag(mat)
## [1] 1 5 9
# Transpose the matrix
t(mat)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Task 3: Working with a Data Frame

# Create a small dataset
students <- data.frame(
  ID = c(101, 102, 103),
  Name = c("Emma", "Liam", "Olivia"),
  Score = c(88, 92, 85)
)

# Add a new column
students$Grade <- c("B", "A", "B")
print(students)
##    ID   Name Score Grade
## 1 101   Emma    88     B
## 2 102   Liam    92     A
## 3 103 Olivia    85     B
# Filter students with score > 90
high_scores <- subset(students, Score > 90)
print(high_scores)
##    ID Name Score Grade
## 2 102 Liam    92     A

Exercise 3: Exploring Data Frames

# Create a sample employee dataset
employees <- data.frame(
  ID = c(101, 102, 103),
  Name = c("Emma", "Liam", "Olivia"),
  Department = c("HR", "IT", "Marketing"),
  Salary = c(55000, 65000, 60000))

# Add a new column indicating whether salary is above $60,000
employees$HighEarner <- employees$Salary > 60000
print(employees)
##    ID   Name Department Salary HighEarner
## 1 101   Emma         HR  55000      FALSE
## 2 102   Liam         IT  65000       TRUE
## 3 103 Olivia  Marketing  60000      FALSE
# Filter employees who earn more than $60,000
high_earners <- subset(employees, Salary > 60000)
print(high_earners)
##    ID Name Department Salary HighEarner
## 2 102 Liam         IT  65000       TRUE

7. Homework Assignment One

Exercise 1: Vector Practice

  • Create a numeric vector with values from 1 to 20.

  • Calculate the sum and mean.

  • Convert it to a character vector and print the result.

Exercise 2: Matrix Challenge

  • Create a 4x4 matrix filled with numbers from 1 to 16.

  • Extract the diagonal elements.

  • Transpose the matrix.

Exercise 3: Data Frame Exploration

  • Create a data frame representing 5 employees with fields: Name, Department, Salary.

  • Compute the average salary.

  • Add a column indicating whether salary is above $60,000 (TRUE/FALSE).

8. Homework Assignment Two

Exercise A: Vector Mastery

  • Create a numeric vector containing numbers from 1 to 20.

  • Compute the sum and average.

  • Convert it to a character vector and print the result.

Exercise B: Matrix Challenge

  • Create a 4x4 matrix with numbers from 1 to 16.

  • Extract the diagonal, transpose, and last row.

  • Replace the last row with zeros.

Exercise C: Data Frame Analysis

  • Create a data frame representing 5 students with fields: Name, mathematics, Score.

  • Compute the average Score.

  • Add a column indicating whether Score is above 50 (use TRUE/FALSE).

  • Sort the students by Score in descending order.

Additional Resources