1. INTRODUCTION TO R
1.1 What is R?
R is a statistical software and programming language used for:
Data entry
Data analysis
Graphs and Charts
Summary statistics
Reporting results
It is widely used in:
Research
Business
Economics
Health sciences
Social sciences
1.2 Why use R for Statistics Practical?
R helps students to:
Analyze real data
Produce tables and graphs
Compute statistics quickly
Understand statistical concepts practically
Build reproducible analyses
1.3 Installing R and RStudio
Students should install:
R -Download from: https://cran.r-project.org
RStudio - Download from: https://posit.co/download/rstudio-desktop/
RStudio Interface This is where you run code.
Source(Scripts): Write and save R code here.
Console: Execute commands interactively.
Environment: View variables, data frames, and functions in memory.
Plots: Display graphical outputs.
- Install packages only once per machine: A package is a collection of functions designed to perform specific tasks.
For this course, two important packages are:
1. ggplot2
ggplot2 is used for drawing graphs and charts in a clear and attractive way.
It helps students create:
Bar charts
Histograms
Boxplots
Scatter plots
Line graphs
2. dplyr
dplyr is used for data manipulation and summarization.
It helps students to:
Select variables
Filter rows
Create new variables
Arrange data
Summarize data easily
# install.packages("ggplot2")
# install.packages("dplyr")- Load packages at the start of every new session:
# library(ggplot2)
# library(dplyr)2. BASIC R TRAINING FOR BEGINNERS
2.1 R as a calculator
R can perform simple arithmetic.
a = 2 + 3
b = 10 - 4
c = 6 * 5
d = 20 / 4
e = 2^3
f = sqrt(49)2.2 Assigning values to objects
In R, we store values in objects using `<-`.
x <- 10
y <- 5
x + y[1] 15
x * y[1] 50
You can also use `=` but `<-` is preferred.
z = 7
z[1] 7
2.3 Creating vectors
A vector is a list of values of the same type.
scores <- c(56, 67, 45, 80, 72, 61, 59, 90)
scores[1] 56 67 45 80 72 61 59 90
Useful commands:
length(scores)[1] 8
sum(scores)[1] 530
mean(scores)[1] 66.25
min(scores)[1] 45
max(scores)[1] 90
sort(scores)[1] 45 56 59 61 67 72 80 90
2.4 Types of data in R
R handles different kinds of data:
Numeric data
age <- c(18, 19, 20, 21)Character data
names <- c("Ama", "Kojo", "Esi", "Yaw")Logical data
passed <- c(TRUE, FALSE, TRUE, TRUE)Factor (categorical data)
gender <- factor(c("Male", "Female", "Female", "Male"))
gender[1] Male Female Female Male
Levels: Female Male
2.5 Creating a data frame
A data frame is like a spreadsheet table.
students <- data.frame(
Name = c("Ama", "Kojo", "Esi", "Yaw", "Akosua"),
Age = c(19, 20, 18, 21, 19),
Gender = c("Female", "Male", "Female", "Male", "Female"),
Score = c(78, 65, 88, 70, 90)
)
students Name Age Gender Score
1 Ama 19 Female 78
2 Kojo 20 Male 65
3 Esi 18 Female 88
4 Yaw 21 Male 70
5 Akosua 19 Female 90
Check the structure:
str(students)'data.frame': 5 obs. of 4 variables:
$ Name : chr "Ama" "Kojo" "Esi" "Yaw" ...
$ Age : num 19 20 18 21 19
$ Gender: chr "Female" "Male" "Female" "Male" ...
$ Score : num 78 65 88 70 90
summary(students) Name Age Gender Score
Length:5 Min. :18.0 Length:5 Min. :65.0
Class :character 1st Qu.:19.0 Class :character 1st Qu.:70.0
Mode :character Median :19.0 Mode :character Median :78.0
Mean :19.4 Mean :78.2
3rd Qu.:20.0 3rd Qu.:88.0
Max. :21.0 Max. :90.0
2.6 Viewing parts of data
students$Score[1] 78 65 88 70 90
students$Name[1] "Ama" "Kojo" "Esi" "Yaw" "Akosua"
students[1, ] # first row Name Age Gender Score
1 Ama 19 Female 78
students[, 2] # second column[1] 19 20 18 21 19
students[1:3, ] # first three rows Name Age Gender Score
1 Ama 19 Female 78
2 Kojo 20 Male 65
3 Esi 18 Female 88
3. DATA COLLECTION AND SURVEYS
3.1 Meaning of data
Data are facts, figures, or observations collected for analysis.
Examples:
Age of students
Test scores
Gender
Height
Income
Marital status
3.2 Sources of data
- Primary data: collected directly by the researcher
- Secondary data: obtained from existing sources
3.3 Methods of data collection
- Questionnaires
- Interviews
- Observation
- Experiments
- Administrative records
3.4 Surveys
A survey is a method of collecting information from individuals.
Types of surveys
- Census
- Sample survey
Sampling methods
- Simple random sampling
- Systematic sampling
- Stratified sampling
- Cluster sampling
- Convenience sampling
3.5 Entering survey data into R
Example survey data:
survey <- data.frame(
Gender = c("Male", "Female", "Female", "Male", "Male", "Female", "Male", "Female", "Female", "Male"),
Age = c(20, 19, 21, 22, 20, 18, 23, 19, 20, 21),
StudyHours = c(3, 4, 5, 2, 6, 4, 3, 5, 4, 2),
Satisfaction = c("High", "Medium", "High", "Low", "High", "Medium", "Low", "High", "Medium", "Low")
)
survey Gender Age StudyHours Satisfaction
1 Male 20 3 High
2 Female 19 4 Medium
3 Female 21 5 High
4 Male 22 2 Low
5 Male 20 6 High
6 Female 18 4 Medium
7 Male 23 3 Low
8 Female 19 5 High
9 Female 20 4 Medium
10 Male 21 2 Low
Check summary:
summary(survey) Gender Age StudyHours Satisfaction
Length:10 Min. :18.00 Min. :2.00 Length:10
Class :character 1st Qu.:19.25 1st Qu.:3.00 Class :character
Mode :character Median :20.00 Median :4.00 Mode :character
Mean :20.30 Mean :3.80
3rd Qu.:21.00 3rd Qu.:4.75
Max. :23.00 Max. :6.00
4. DATA PRESENTATION
4.1 Types of data
1. Qualitative (Categorical) data
These describe qualities or categories.
Examples:
Gender
Religion
Marital status
Blood group
2. Quantitative (Numerical) data
These are numbers.
(a) Discrete data
Countable values
Examples:
Number of children
Number of cars
Number of students
(b) Continuous data
Measured values
Examples:
Height
Weight
Temperature
Time
4.2 Frequency distributions and tabulation
Example 1: Frequency table for categorical data
gender <- c("Male", "Female", "Female", "Male", "Male", "Female", "Male", "Female")
table(gender)gender
Female Male
4 4
Relative frequency:
prop.table(table(gender))gender
Female Male
0.5 0.5
Percentage frequency:
prop.table(table(gender)) * 100gender
Female Male
50 50
Example 2: Frequency table for numerical data
scores <- c(45, 50, 55, 60, 60, 65, 70, 70, 70, 75, 80, 85, 90)
table(scores)scores
45 50 55 60 65 70 75 80 85 90
1 1 1 2 1 3 1 1 1 1
Grouped frequency distribution
Suppose we want class intervals.
marks <- c(34, 45, 56, 67, 78, 89, 90, 43, 55, 61, 73, 84, 39, 48, 52)
classes <- cut(marks,
breaks = c(30, 40, 50, 60, 70, 80, 90),
right = FALSE)
table(classes)classes
[30,40) [40,50) [50,60) [60,70) [70,80) [80,90)
2 3 3 2 2 2
4.3 Cross-tabulation
Used to summarize two categorical variables.
table(survey$Gender, survey$Satisfaction)
High Low Medium
Female 2 0 3
Male 2 3 0
Add proportions:
prop.table(table(survey$Gender, survey$Satisfaction))
High Low Medium
Female 0.2 0.0 0.3
Male 0.2 0.3 0.0
Row proportions:
prop.table(table(survey$Gender, survey$Satisfaction), 1)
High Low Medium
Female 0.4 0.0 0.6
Male 0.4 0.6 0.0
Column proportions:
prop.table(table(survey$Gender, survey$Satisfaction), 2)
High Low Medium
Female 0.5 0.0 1.0
Male 0.5 1.0 0.0
5. GRAPHICAL REPRESENTATION OF DATA
5.1 Bar chart
Suitable for categorical data.
gender_tab <- table(survey$Gender)
barplot(gender_tab,
main = "Bar Chart of Gender",
xlab = "Gender",
ylab = "Frequency",
col = c("skyblue", "pink"))5.2 Pie chart
pie(gender_tab,
main = "Pie Chart of Gender",
col = c("skyblue", "pink"))5.3 Histogram
Suitable for continuous numerical data.
hist(survey$Age,
main = "Histogram of Age",
xlab = "Age",
col = "lightgreen",
border = "black")5.4 Frequency polygon
hist(survey$Age, plot = FALSE)$breaks
[1] 18 19 20 21 22 23
$counts
[1] 3 3 2 1 1
$density
[1] 0.3 0.3 0.2 0.1 0.1
$mids
[1] 18.5 19.5 20.5 21.5 22.5
$xname
[1] "survey$Age"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
h <- hist(survey$Age, plot = FALSE)
plot(h$mids, h$counts, type = "b",
main = "Frequency Polygon of Age",
xlab = "Age",
ylab = "Frequency",
col = "blue")5.5 Boxplot
Useful for showing spread and outliers.
boxplot(survey$StudyHours,
main = "Boxplot of Study Hours",
ylab = "Hours",
col = "orange")Compare groups:
boxplot(Score ~ Gender, data = students,
main = "Scores by Gender",
xlab = "Gender",
ylab = "Score",
col = c("pink", "lightblue"))5.6 Stem-and-leaf plot
stem(scores)
The decimal point is 1 digit(s) to the right of the |
4 | 5
5 | 05
6 | 005
7 | 0005
8 | 05
9 | 0
5.7 Scatter plot
For two numerical variables.
plot(survey$Age, survey$StudyHours,
main = "Scatter Plot of Age and Study Hours",
xlab = "Age",
ylab = "Study Hours",
pch = 19,
col = "red")6. PRINCIPLES OF EFFECTIVE DATA VISUALIZATION
Students should learn the following:
- Give every graph a clear title
- Label axes properly
- Use appropriate graph for the data type
- Avoid too many colors
- Keep graphs simple and readable
- Show units where necessary
- Avoid misleading scales
- Use legends when needed
Example of a well-labeled plot
hist(marks,
main = "Distribution of Students' Marks",
xlab = "Marks",
ylab = "Frequency",
col = "lightblue",
border = "black")7. MEASURES OF CENTRAL TENDENCY
Measures of central tendency describe the center of the data.
7.1 Mean
The arithmetic average.
scores <- c(56, 67, 45, 80, 72, 61, 59, 90)
mean(scores)[1] 66.25
7.2 Median
The middle value after arranging data.
scores <- c(56, 67, 45, 80, 72, 61, 59, 90)
median(scores)[1] 64
7.3 Mode
R has no built-in mode for statistical mode, but we can define it.
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
data1 <- c(2, 4, 4, 5, 6, 7, 4, 8)
get_mode(data1)[1] 4
7.4 Comparing mean, median, and mode
data2 <- c(10, 12, 14, 15, 15, 16, 18, 100)
mean(data2)[1] 25
median(data2)[1] 15
get_mode(data2)[1] 15
This helps show the effect of an outlier.
8. MEASURES OF DISPERSION
These describe how spread out the data are.
8.1 Range
range(scores)[1] 45 90
max(scores) - min(scores)[1] 45
8.2 Variance
var(scores)[1] 203.3571
8.3 Standard deviation
sd(scores)[1] 14.26033
8.4 Interquartile range (IQR)
IQR(scores)[1] 15.75
8.5 Five-number summary
scores <- c(56, 67, 45, 80, 72, 61, 59, 90)
fivenum(scores)[1] 45.0 57.5 64.0 76.0 90.0
summary(scores) Min. 1st Qu. Median Mean 3rd Qu. Max.
45.00 58.25 64.00 66.25 74.00 90.00
9. SHAPE OF DISTRIBUTIONS
The shape of a distribution can be:
- symmetric
- positively skewed (right-skewed)
- negatively skewed (left-skewed)
9.1 Visual inspection using histogram
hist(data2,
main = "Histogram of Data2",
xlab = "Values",
col = "purple")Because of the value 100, the distribution is positively skewed.
9.2 Skewness idea using mean and median
- If mean > median, distribution may be right-skewed
- If mean < median, distribution may be left-skewed
- If mean ≈ median, distribution may be symmetric
Example:
mean(data2)[1] 25
median(data2)[1] 15
10. APPLICATIONS IN SUMMARIZING REAL DATA
We now analyze a small realistic dataset.
class_data <- data.frame(
Student = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"),
Gender = c("Male", "Female", "Female", "Male", "Female", "Male", "Male", "Female", "Female", "Male"),
Age = c(18, 19, 18, 20, 21, 19, 20, 18, 19, 21),
Score = c(65, 78, 82, 55, 91, 73, 68, 84, 79, 60)
)
class_data Student Gender Age Score
1 A Male 18 65
2 B Female 19 78
3 C Female 18 82
4 D Male 20 55
5 E Female 21 91
6 F Male 19 73
7 G Male 20 68
8 H Female 18 84
9 I Female 19 79
10 J Male 21 60
10.1 Summary statistics
summary(class_data) Student Gender Age Score
Length:10 Length:10 Min. :18.00 Min. :55.00
Class :character Class :character 1st Qu.:18.25 1st Qu.:65.75
Mode :character Mode :character Median :19.00 Median :75.50
Mean :19.30 Mean :73.50
3rd Qu.:20.00 3rd Qu.:81.25
Max. :21.00 Max. :91.00
mean(class_data$Score)[1] 73.5
median(class_data$Score)[1] 75.5
sd(class_data$Score)[1] 11.38469
var(class_data$Score)[1] 129.6111
IQR(class_data$Score)[1] 15.5
10.2 Frequency table for gender
table(class_data$Gender)
Female Male
5 5
prop.table(table(class_data$Gender)) * 100
Female Male
50 50
10.3 Graphs
Bar chart of gender
barplot(table(class_data$Gender),
main = "Gender Distribution",
xlab = "Gender",
ylab = "Frequency",
col = c("pink", "lightblue"))Histogram of scores
hist(class_data$Score,
main = "Distribution of Scores",
xlab = "Score",
col = "gold",
border = "black")Boxplot of scores
boxplot(class_data$Score,
main = "Boxplot of Scores",
ylab = "Score",
col = "cyan")10.4 Group comparison
Compare scores by gender:
aggregate(Score ~ Gender, data = class_data, mean) Gender Score
1 Female 82.8
2 Male 64.2
aggregate(Score ~ Gender, data = class_data, median) Gender Score
1 Female 82
2 Male 65
aggregate(Score ~ Gender, data = class_data, sd) Gender Score
1 Female 5.167204
2 Male 6.978539
Boxplot by gender:
boxplot(Score ~ Gender, data = class_data,
main = "Scores by Gender",
xlab = "Gender",
ylab = "Score",
col = c("pink", "lightblue"))11. MISSING VALUES IN R
Sometimes data contain missing values written as `NA`.
x <- c(12, 15, NA, 20, 18)
mean(x)[1] NA
mean(x, na.rm = TRUE)[1] 16.25
sum(x, na.rm = TRUE)[1] 65
Important: remind students to use `na.rm = TRUE` when necessary.
12. SIMPLE CLASS PRACTICAL EXERCISES
Exercise 1: Create a vector
Create a vector containing the ages: 18, 19, 20, 18, 21, 22, 19, 20
Tasks:
find the mean
find the median
find the range
find the standard deviation
ages <- c(18, 19, 20, 18, 21, 22, 19, 20)
mean(ages)[1] 19.625
median(ages)[1] 19.5
range(ages)[1] 18 22
sd(ages)[1] 1.407886
Exercise 2: Frequency table
Use the following data on preferred drink:
Tea, Coffee, Tea, Juice, Coffee, Tea, Juice, Tea
drink <- c("Tea", "Coffee", "Tea", "Juice", "Coffee", "Tea", "Juice", "Tea")
table(drink)drink
Coffee Juice Tea
2 2 4
prop.table(table(drink)) * 100drink
Coffee Juice Tea
25 25 50
barplot(table(drink), col = c("brown", "orange", "green"))pie(table(drink), col = c("brown", "orange", "green"))Exercise 3: Histogram and boxplot
heights <- c(150, 155, 160, 162, 158, 170, 172, 168, 165, 159)
hist(heights,
main = "Histogram of Heights",
xlab = "Height (cm)",
col = "lightblue")boxplot(heights,
main = "Boxplot of Heights",
ylab = "Height (cm)",
col = "lightgreen")Exercise 4: Data frame practice
mydata <- data.frame(
Name = c("John", "Mary", "Peter", "Linda"),
Age = c(20, 21, 19, 22),
Score = c(75, 88, 67, 90)
)
mydata Name Age Score
1 John 20 75
2 Mary 21 88
3 Peter 19 67
4 Linda 22 90
summary(mydata) Name Age Score
Length:4 Min. :19.00 Min. :67.0
Class :character 1st Qu.:19.75 1st Qu.:73.0
Mode :character Median :20.50 Median :81.5
Mean :20.50 Mean :80.0
3rd Qu.:21.25 3rd Qu.:88.5
Max. :22.00 Max. :90.0
mean(mydata$Score)[1] 80
13. INTRODUCTION TO SOME USEFUL R COMMANDS
help(mean)starting httpd help server ... done
?mean
ls() # list objects in memory [1] "a" "age" "ages" "b" "c"
[6] "class_data" "classes" "d" "data1" "data2"
[11] "drink" "e" "f" "gender" "gender_tab"
[16] "get_mode" "h" "heights" "marks" "mydata"
[21] "names" "passed" "scores" "students" "survey"
[26] "x" "y" "z"
rm(x) # remove object x
rm(list = ls()) # remove all objectsGet working directory:
## getwd()Set working directory:
## setwd("C:/Users/YourName/Documents")14. IMPORTING DATA FROM CSV
If students have data in Excel, save it as CSV first.
# data <- read.csv("students.csv")
# head(data)
# str(data)
# summary(data)15. SIMPLE TEACHING FLOW FOR BEGINNERS
A practical training can follow this order:
Week 1: Introduction to R
- opening RStudio
- arithmetic operations
- assigning objects
- creating vectors
Week 2: Data types and data frames
- numeric and categorical data
- factors
- data frames
- summary()
Week 3: Frequency tables and tabulation
- table()
- prop.table()
- cross-tabulation
Week 4: Graphs
- bar charts
- pie charts
- histograms
- boxplots
- scatter plots
Week 5: Descriptive statistics
- mean
- median
- mode
- range
- variance
- standard deviation
- IQR
Week 6: Real data applications
- importing CSV files
- summarizing data
- interpreting results
16. SAMPLE INTERPRETATION OF RESULTS
Suppose:
scores <- c(56, 67, 45, 80, 72, 61, 59, 90)
mean(scores)[1] 66.25
median(scores)[1] 64
sd(scores)[1] 14.26033
Possible interpretation:
- The mean score represents the average performance of students.
- The median score gives the middle score and is less affected by extreme values.
- The standard deviation shows how much student scores vary around the mean.
- A small standard deviation means scores are close together.
- A large standard deviation means scores are widely spread.
17. VERY IMPORTANT BEGINNER TIPS
R is case-sensitive
scoreandScoreare different.Always use parentheses correctly
Check spelling of variable names
Use
str()andsummary()to inspect dataSave your script regularly
Comment your code with
#
Example:
# Calculate mean score
mean(scores)[1] 66.25
18. A COMPLETE BEGINNER EXAMPLE IN R
Students can copy and run this full example:
# Create dataset
students <- data.frame(
Name = c("Ama", "Kojo", "Esi", "Yaw", "Akosua", "Kofi", "Abena", "Kwame"),
Gender = c("Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male"),
Age = c(19, 20, 18, 21, 19, 22, 20, 21),
Score = c(78, 65, 88, 70, 90, 55, 84, 73)
)
# View data
students Name Gender Age Score
1 Ama Female 19 78
2 Kojo Male 20 65
3 Esi Female 18 88
4 Yaw Male 21 70
5 Akosua Female 19 90
6 Kofi Male 22 55
7 Abena Female 20 84
8 Kwame Male 21 73
# Structure and summary
str(students)'data.frame': 8 obs. of 4 variables:
$ Name : chr "Ama" "Kojo" "Esi" "Yaw" ...
$ Gender: chr "Female" "Male" "Female" "Male" ...
$ Age : num 19 20 18 21 19 22 20 21
$ Score : num 78 65 88 70 90 55 84 73
summary(students) Name Gender Age Score
Length:8 Length:8 Min. :18 Min. :55.00
Class :character Class :character 1st Qu.:19 1st Qu.:68.75
Mode :character Mode :character Median :20 Median :75.50
Mean :20 Mean :75.38
3rd Qu.:21 3rd Qu.:85.00
Max. :22 Max. :90.00
# Frequency table for gender
table(students$Gender)
Female Male
4 4
prop.table(table(students$Gender)) * 100
Female Male
50 50
# Measures of central tendency for scores
mean(students$Score)[1] 75.375
median(students$Score)[1] 75.5
# Mode function
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
get_mode(students$Score)[1] 78
# Measures of dispersion
range(students$Score)[1] 55 90
sd(students$Score)[1] 12.02304
var(students$Score)[1] 144.5536
IQR(students$Score)[1] 16.25
# Graphs
barplot(table(students$Gender),
main = "Gender Distribution",
col = c("pink", "lightblue"))hist(students$Score,
main = "Histogram of Scores",
xlab = "Score",
col = "lightgreen")boxplot(students$Score,
main = "Boxplot of Scores",
ylab = "Score",
col = "orange")boxplot(Score ~ Gender, data = students,
main = "Scores by Gender",
xlab = "Gender",
ylab = "Score",
col = c("pink", "lightblue"))19. SUGGESTED PRACTICAL QUESTIONS FOR LEARNERS
- Enter a dataset of 10 students with variables:
- Name
- Age
- Gender
- Test score
- Produce:
- frequency table for gender
- bar chart for gender
- histogram for test scores
- boxplot for test scores
- Calculate:
- mean
- median
- mode
- range
- variance
- standard deviation
- IQR
- Interpret the results.
20. CONCLUSION
Using R in introductory statistics practicals helps students move from theory to practice. With simple commands, they can:
Organize data
Summarize data
Visualize data
Interpret statistical measures
For beginners, start with:
Vectors
Data frames
Tables
Charts
Mean, Median, and Standard deviation