How to use this notebook in class
- Read the text.
- Run code chunks one-by-one (click the green ▶ button).
- Answer the quick tasks where you see Task boxes.
- At the end, click Knit to produce a clean HTML report.
You need:
If you are already in the computer lab and can open RStudio, you’re good.
Run the next chunk. If it prints your R version, you are ready.
R.version.string
## [1] "R version 4.5.2 (2025-10-31 ucrt)"
When you open RStudio, you usually see these panes:
It is the folder where R looks for files by default (CSV, images, saved outputs).
Check your current working directory:
getwd()
## [1] "C:/Users/uSer/OneDrive/Documents/SEMESTER 1.2"
Use an RStudio Project so your work is organized.
SDS1201After that: - Put your data files inside that project folder - Your working directory becomes stable
If you must set it manually:
setwd("C:/Users/YourName/Documents/SDS1201")
Note: In class, we prefer Projects over
setwd().
You only install a package once (like installing an app).
install.packages("tidyverse")
install.packages("readr")
Each time you open RStudio, you load packages using
library().
# If tidyverse is installed, this should load without errors:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
If you see an error like “there is no package called
…”
→ you must install it first.
An R Markdown file is a document that combines: - text explanations - code chunks - output (tables, plots)
A code chunk looks like this:
``` r
2 + 2
```
```
## [1] 4
```
Try it:
2 + 2
## [1] 4
Click Knit (top of RStudio) to generate an HTML report.
A variable stores a value.
x <- 10
y <- 3
x + y
## [1] 13
A vector is a 1D collection of values of the same type.
scores <- c(65, 70, 80, 55)
scores
## [1] 65 70 80 55
Common operations:
length(scores)
## [1] 4
mean(scores)
## [1] 67.5
max(scores)
## [1] 80
Indexing:
scores[1] # first item
## [1] 65
scores[2:4] # items 2 to 4
## [1] 70 80 55
scores[scores >= 70] # conditional selection
## [1] 70 80
A data frame is a table (rows and columns).
students <- data.frame(
name = c("Amina", "Brian", "Carol"),
age = c(20, 21, 19),
score = c(78, 62, 85)
)
students
Access columns:
students$score
## [1] 78 62 85
mean(students$score)
## [1] 75
readr::read_csv()It is fast and gives clean output.
# We'll create a small CSV in memory for practice
tmp_file <- tempfile(fileext = ".csv")
writeLines(
c("name,age,score",
"Amina,20,78",
"Brian,21,62",
"Carol,19,85"),
tmp_file
)
df <- readr::read_csv(tmp_file, show_col_types = FALSE)
df
glimpse(df)
## Rows: 3
## Columns: 3
## $ name <chr> "Amina", "Brian", "Carol"
## $ age <dbl> 20, 21, 19
## $ score <dbl> 78, 62, 85
summary(df)
## name age score
## Length:3 Min. :19.0 Min. :62.0
## Class :character 1st Qu.:19.5 1st Qu.:70.0
## Mode :character Median :20.0 Median :78.0
## Mean :20.0 Mean :75.0
## 3rd Qu.:20.5 3rd Qu.:81.5
## Max. :21.0 Max. :85.0
Missing values:
colSums(is.na(df))
## name age score
## 0 0 0
We will use a simple structure in this course:
Let’s do a small example.
set.seed(1)
demo <- tibble(
student_id = 1:12,
gender = sample(c("F", "M"), 12, replace = TRUE),
math = sample(c(NA, 40:95), 12, replace = TRUE),
english = sample(c(NA, 40:95), 12, replace = TRUE)
)
demo
colSums(is.na(demo))
## student_id gender math english
## 0 0 0 0
Here we fill missing marks using the subject mean (basic imputation).
demo_clean <- demo %>%
mutate(
math = ifelse(is.na(math), mean(math, na.rm = TRUE), math),
english = ifelse(is.na(english), mean(english, na.rm = TRUE), english),
average = (math + english) / 2
)
demo_clean
Mean by gender:
demo_clean %>%
group_by(gender) %>%
summarise(
n = n(),
mean_math = mean(math),
mean_english = mean(english),
mean_average = mean(average),
.groups = "drop"
)
ggplot(demo_clean, aes(x = math, y = english)) +
geom_point() +
labs(
title = "Math vs English (Demo Data)",
x = "Math score",
y = "English score"
)
A distribution plot:
ggplot(demo_clean, aes(x = average)) +
geom_histogram(bins = 8) +
labs(
title = "Distribution of Average Score",
x = "Average",
y = "Count"
)
If you want to save an R object to re-use later:
saveRDS(demo_clean, "demo_clean.rds")
To load it:
loaded <- readRDS("demo_clean.rds")
head(loaded)
readr::write_csv(demo_clean, "demo_clean.csv")
? help?mean
?read_csv
help.search("histogram")
str(demo_clean)
## tibble [12 × 5] (S3: tbl_df/tbl/data.frame)
## $ student_id: int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr [1:12] "F" "M" "F" "F" ...
## $ math : int [1:12] 71 59 59 80 92 84 48 45 47 53 ...
## $ english : int [1:12] 79 63 84 75 75 72 80 63 82 53 ...
## $ average : num [1:12] 75 61 71.5 77.5 83.5 78 64 54 64.5 53 ...
names(demo_clean)
## [1] "student_id" "gender" "math" "english" "average"
Error:
there is no package called 'tidyverse'
Fix: install once, then load.
install.packages("tidyverse")
library(tidyverse)
Error: cannot open the connection
Fix: check working directory + file name.
getwd()
## [1] "C:/Users/uSer/OneDrive/Documents/SEMESTER 1.2"
list.files()
## [1] "00_Introduction.html"
## [2] "01_Foundations_Data_Analytics_R.Rmd"
## [3] "02_Data_Cleaning_Preparation_R.Rmd"
## [4] "03_Exploratory_Data_Analysis_R (2).Rmd"
## [5] "04_Introduction_2_Analytical_Models_R.Rmd"
## [6] "05_Model_Validation_Responsible_Analysis_R.Rmd"
## [7] "06_Applied_Project_R.Rmd"
## [8] "2025_A_KSD_1466_F.html"
## [9] "2025_A_KSD_1466_F.Rmd"
## [10] "2025_A_KSD_1466_F_files"
## [11] "airquality.csv"
## [12] "clean_student_grades.csv"
## [13] "demo_clean.csv"
## [14] "demo_clean.rds"
## [15] "dice_game_results.csv"
## [16] "dice_rolls.csv"
## [17] "dice_sum_counts.csv"
## [18] "DOC-20250406-WA0005.pdf"
## [19] "Macro economics - introductory.pdf"
## [20] "messy_data.csv"
## [21] "Naboth_Harris.html"
## [22] "processed_mtcars.csv"
## [23] "progam.csv"
## [24] "progam.rds"
## [25] "Riemann Integration.pdf"
## [26] "rsconnect"
## [27] "Share with CamScanner.zip"
## [28] "STA1204_APT_COURSE CONTENT APPLIED PROBABILITY.pdf"
## [29] "STA1205_COURSE CONTENT MATHEMATICAL STATISTICS.pdf"
## [30] "student.csv"
## [31] "students.csv"
## [32] "students.rds"
## [33] "students_clean.csv"
## [34] "The-Academic-Policy-and-Examination-Regulations-Kabale-University.pdf"
In R: - decimal is 3.14 not 3,14
Create a small data frame named profile with: - your
name - your program - your home district - your favorite number
Then print it and show its structure.
# TODO: write your code here
profile <- data.frame(
name = c("AHABWE NABOTH KAKURU"),
program = c("STATISTICS AND DATA SCIENCE"),
home_district = c("KABALE"),
fav_number = 0040910635672174.74
)
profile
Create a vector of 10 numbers (any numbers), then compute: - mean - median - standard deviation - min and max
# TODO: write your code here
v1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
mean(v1)
## [1] 5.5
median(v1)
## [1] 5.5
sd(v1)
## [1] 3.02765
min(v1)
## [1] 1
max(v1)
## [1] 10
Create a data frame with two columns x and
y (10 rows), then make a scatter plot.
# TODO: write your code here
dading <- data.frame(
x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
y = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
)
dading
plot(dading$x, dading$y,
main = "X valiations against Y",
xlab = "X values",
ylab = "Y values",
pch = 21,
col = "red")
read_csv().glimpse() and summary().# TODO: write your code here
progam <- data.frame(
name = c("Naboth", "Nebart", "Lynn", "Ian", "Collins"),
scores = c(90, 80, 70, 60, 50),
grades = c("A", "B", "C", "D", "E")
)
progam
write.csv(progam, "progam.csv", row.names = FALSE)
read.csv("progam.csv")
glimpse(progam)
## Rows: 5
## Columns: 3
## $ name <chr> "Naboth", "Nebart", "Lynn", "Ian", "Collins"
## $ scores <dbl> 90, 80, 70, 60, 50
## $ grades <chr> "A", "B", "C", "D", "E"
summary(progam)
## name scores grades
## Length:5 Min. :50 Length:5
## Class :character 1st Qu.:60 Class :character
## Mode :character Median :70 Mode :character
## Mean :70
## 3rd Qu.:80
## Max. :90
If you can run this notebook and complete Tasks A–D, you are ready for SDS 1201.