Session 1: Introduction to Data science, R, RStudio, and Basic Data Types
Session 2: Data Import, Cleaning, and Exploratory Data Analysis (EDA)
Session 3: Data Manipulation with dplyr
Session 4: Data Visualization with ggplot2
Session 5: Probability Distributions and Random Variables
Session 6: Hypothesis Testing
Session 7: Regression Analysis, Correlation and Time Series
Session 8: Analysis of Variance (ANOVA) and Non-Parametric Tests
Session 9: Reporting with RMarkdown
Session 10: Capstone Project
By the end of this session, you will be able to:
Understand what is Data Science, its Aims and Application
Understand what R and RStudio are and how they differ.
Install and configure R and RStudio on your system.
Navigate the RStudio interface (Console, Script Editor, Environment, Plots).
Use basic R syntax and understand core data types: vectors, matrices, lists, and data frames.
Perform basic operations and write simple R scripts.
Prepare for working with real-world datasets in upcoming sessions
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
Data Exploration & Analysis – Discover patterns and trends.
Predictive Modeling – Forecast future outcomes using machine learning.
Decision-Making – Support business and scientific decisions with data.
Automation – Build intelligent systems (e.g., recommendation engines).
Visualization – Communicate insights effectively.
• Business: Customer segmentation, fraud detection, sales forecasting.
• Healthcare: Disease prediction, drug discovery, medical imaging.
• Finance: Risk assessment, algorithmic trading, credit scoring.
• Marketing: Sentiment analysis, personalized recommendations.
• Social Media: Trend analysis, user behavior modeling.
R is a programming language and software environment for statistical computing , data analysis , and graphics.
Originally developed by Ross Ihaka and Robert Gentleman at the University of Auckland.
Open-source, community-driven, and widely used in academia, research, and industry .
Key strengths:
Rich ecosystem of packages (e.g., dplyr, ggplot2, caret)
Built-in support for statistical models
Strong community and active development
RStudio is an Integrated Development Environment (IDE) for working with R.
Provides a more user-friendly interface with tools for writing code, visualizing data, managing files, and debugging.
Available as Desktop (local) or Server (cloud-based) versions.
NB: Think of R as the engine and RStudio as the dashboard — both together make driving (coding) easier and more efficient
Download and install the version appropriate for your OS (Windows, macOS, Linux).
Download and install the free desktop version.
After installation, open RStudio.
You’ll see multiple panes: Console, Script Editor, Environment, etc.
Console: Where you type commands and see immediate output.
Script Editor: Write and save R scripts here (File > New File > R Script)
Environment/History: Lists objects (variables) created and command history
Files/Plots/Packages/Help: File browser, plot viewer, package manager, help documentation
Before you begin, you might want to create a new project in RStudio. A project is a self-contained working environment that helps you manage your work efficiently. It includes your R scripts, datasets, outputs, and settings in one place.You can name the project and choose a directory to save it in.
## [1] 13
## [1] 7
## [1] 30
## [1] 3.333333
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] 6
## [1] 4
## [1] 2
## [1] 5
## [1] 20
## [1] 3
## [1] 1 3 5 7 9
## [1] "hello" "hello" "hello"
R has several basic data structures
A vector is the simplest data structure in R. It contains elements of the same type.
# Create a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
# Create a character vector
char_vector <- c("apple", "banana", "cherry")
# Create a logical vector
logical_vector <- c(TRUE, FALSE, TRUE)
# Print vector
print(numeric_vector)
## [1] 1 2 3 4 5
## [1] "apple" "banana" "cherry"
## [1] TRUE FALSE TRUE
Attention: All elements must be the same type; if mixed, coercion occurs
## [1] "1" "two" "TRUE"
Useful Functions for Vectors
# Create a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
# Create a character vector
char_vector <- c("apple", "banana", "cherry")
# Create a logical vector
logical_vector <- c(TRUE, FALSE, TRUE)
length(numeric_vector) # Length of the vector
## [1] 5
## [1] "character"
## [1] TRUE
vec_num <- c(1, 2, 3)
vec_char <- c("apple", "banana")
vec_logical <- c(TRUE, FALSE, TRUE)
length(vec_num) # Length of the vector
## [1] 3
## [1] "double"
## [1] TRUE
A matrix is a 2D vector with rows and columns. All elements must be of the same type.
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [1] 3
Naming Rows and Columns
## Col1 Col2 Col3
## Row1 1 3 5
## Row2 2 4 6
Indexing in Matrices
## [1] 3
## Col1 Col2 Col3
## 2 4 6
## Row1 Row2
## 5 6
A list can contain elements of different types and even other lists.
## $name
## [1] "John"
##
## $age
## [1] 30
##
## $grades
## [1] 85 90 78
## [1] "John"
## [1] 85 90 78
## [1] 90
A data frame is like a spreadsheet or SQL table — rows represent observations, columns represent variables.
# Create a data frame
df <- data.frame(
Name = c("Alice", "Bob", "Charlie", "Ann"),
Age = c(25, 30, 35, 26),
Salary = c(50000, 60000, 70000, 40000))
print(df)
## Name Age Salary
## 1 Alice 25 50000
## 2 Bob 30 60000
## 3 Charlie 35 70000
## 4 Ann 26 40000
## 'data.frame': 4 obs. of 3 variables:
## $ Name : chr "Alice" "Bob" "Charlie" "Ann"
## $ Age : num 25 30 35 26
## $ Salary: num 50000 60000 70000 40000
## Name Age Salary
## Length:4 Min. :25.00 Min. :40000
## Class :character 1st Qu.:25.75 1st Qu.:47500
## Mode :character Median :28.00 Median :55000
## Mean :29.00 Mean :55000
## 3rd Qu.:31.25 3rd Qu.:62500
## Max. :35.00 Max. :70000
## Name Age Salary
## 1 Alice 25 50000
## 2 Bob 30 60000
## 3 Charlie 35 70000
## 4 Ann 26 40000
## [1] 4 3
# Add a new column
df$Department <- c("HR", "Finance", "IT", "Audit")
# Remove a column
df$salary <- NULL
print(df)
## Name Age Salary Department
## 1 Alice 25 50000 HR
## 2 Bob 30 60000 Finance
## 3 Charlie 35 70000 IT
## 4 Ann 26 40000 Audit
## Name Age Salary Department
## 3 Charlie 35 70000 IT
Task 1: Create and Manipulate Vectors
# Create two vectors
vec1 <- c(10, 20, 30)
vec2 <- c("red", "green", "blue")
# Concatenate them
combined_vec <- c(vec1, vec2)
print(combined_vec)
## [1] "10" "20" "30" "red" "green" "blue"
## [1] 6
## [1] "10" "20" "30"
Exercise 1: Working with Vectors
# Create two numeric vectors
vec1 <- c(10, 20, 30)
vec2 <- c(40, 50, 60)
# Concatenate them
combined_vec <- c(vec1, vec2)
print(combined_vec)
## [1] 10 20 30 40 50 60
## [1] "10" "20" "30" "40" "50" "60"
Task 2: Matrix Creation and Indexing
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [1] 8
Exercise 2: Matrix Practice
# Create a 3x3 matrix with values from 1 to 9
mat <- matrix(seq(1, 9), nrow = 3, ncol = 3)
print(mat)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [1] 1 5 9
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Task 3: Working with a Data Frame
# Create a small dataset
students <- data.frame(
ID = c(101, 102, 103),
Name = c("Emma", "Liam", "Olivia"),
Score = c(88, 92, 85)
)
# Add a new column
students$Grade <- c("B", "A", "B")
print(students)
## ID Name Score Grade
## 1 101 Emma 88 B
## 2 102 Liam 92 A
## 3 103 Olivia 85 B
## ID Name Score Grade
## 2 102 Liam 92 A
Exercise 3: Exploring Data Frames
# Create a sample employee dataset
employees <- data.frame(
ID = c(101, 102, 103),
Name = c("Emma", "Liam", "Olivia"),
Department = c("HR", "IT", "Marketing"),
Salary = c(55000, 65000, 60000))
# Add a new column indicating whether salary is above $60,000
employees$HighEarner <- employees$Salary > 60000
print(employees)
## ID Name Department Salary HighEarner
## 1 101 Emma HR 55000 FALSE
## 2 102 Liam IT 65000 TRUE
## 3 103 Olivia Marketing 60000 FALSE
# Filter employees who earn more than $60,000
high_earners <- subset(employees, Salary > 60000)
print(high_earners)
## ID Name Department Salary HighEarner
## 2 102 Liam IT 65000 TRUE
Create a numeric vector with values from 1 to 20.
Calculate the sum and mean.
Convert it to a character vector and print the result.
Create a 4x4 matrix filled with numbers from 1 to 16.
Extract the diagonal elements.
Transpose the matrix.
Create a data frame representing 5 employees with fields: Name, Department, Salary.
Compute the average salary.
Add a column indicating whether salary is above $60,000 (TRUE/FALSE).
Create a numeric vector containing numbers from 1 to 20.
Compute the sum and average.
Convert it to a character vector and print the result.
Create a 4x4 matrix with numbers from 1 to 16.
Extract the diagonal, transpose, and last row.
Replace the last row with zeros.
Create a data frame representing 5 students with fields: Name, mathematics, Score.
Compute the average Score.
Add a column indicating whether Score is above 50 (use TRUE/FALSE).
Sort the students by Score in descending order.
Official R Documentation: https://cran.r-project.org/manuals.html
RStudio Cheatsheets: https://posit.co/download/rstudio-cheatsheets
R for Data Science Book: https://r4ds.hadley.nz/
Try R Playground : https://try.rbind.io/