Overview (Weeks 1–2)

This notebook introduces the foundations of Data Analytics with R.
It focuses on how data analytics supports decision-making, how to work with real datasets, and how to understand data structure before modeling.

By the end of this notebook, you should be comfortable loading datasets, inspecting their structure, summarizing key variables, and producing basic but meaningful plots.


1. Role of Data Analytics in Decision-Making

Data analytics helps convert raw data into evidence for decisions.

Examples: - Which district has the highest school dropout rate? - Are exam scores improving over time? - Which variables best explain loan default?

In practice, analytics follows a simple pipeline:

  1. Define the question
  2. Collect or obtain data
  3. Understand and clean the data
  4. Summarize and visualize
  5. Draw conclusions

R is especially strong in steps 3–5.


2. Types of Data and Analytical Questions

Common Data Types

  • Numeric: marks, income, rainfall
  • Categorical: gender, district, school type
  • Ordinal: grades (A, B, C), ratings (low–high)
  • Time-based: dates, years, months

Types of Questions

  • Descriptive: What is happening?
  • Comparative: How do groups differ?
  • Relational: Are variables related?
  • Predictive: What is likely to happen next?

At this stage, we focus on descriptive and comparative questions.


3. Revisiting R Essentials for Analytics

This course assumes you already know basic R syntax. We briefly revisit only what is essential for analytics.

Vectors

scores <- c(65, 15, -87, 502, 35.086, 72, 80, 55, 90)
mean(scores)
## [1] 91.89844

Function to try out sums of products

mine<-function(n){
  start<-0
    for (i in n){
    start<-start + i
  }
  if (start>10) {
    cat("Not worth it")
  } 
  else if (start < -5){
    cat("exactly have what we need")
  } 
  else {
    cat("Totally acceptable")
  }
}
mine(c(5:40))
## Not worth it
score <- 40
hey <- function(toolit){
  for(i in score:toolit){
    score = toolit * i
  } 
  if(score >= 50){
    cat("Ohh my God!")
  }
  else {
    cat("Lemme tell you something")
  }
}
hey(7)
## Lemme tell you something
hey(8)
## Ohh my God!

Data Frames

students <- data.frame(
  name = c("Ninsiima", "Katusiime", "Ndyowe", "Manzi", "Atwine"),
  gender = c("F", "M", "F", "M", "F"),
  score = c(98, 72, 80, 15, 61)
)

students

4. Importing Real Datasets

Most real analytics begins with external data files, often CSV.

Reading a CSV File

# Example: replace with your actual file path
df1 <- read.csv("students.csv") #change the working directory to where you have saved anyu of your "csv"
df1
# For illustration, we reuse the students data frame
df <- students
df

Key function: - read.csv() reads tabular data into R as a data frame.


5. Understanding Data Structure

Before analysis, always inspect the structure.

Structure of the Dataset

str(df)
## 'data.frame':    5 obs. of  3 variables:
##  $ name  : chr  "Ninsiima" "Katusiime" "Ndyowe" "Manzi" ...
##  $ gender: chr  "F" "M" "F" "M" ...
##  $ score : num  98 72 80 15 61

This tells you: - Number of rows and columns - Variable names - Data types of each variable

Column Names

names(df)
## [1] "name"   "gender" "score"

Dataset Dimensions

dim(df)
## [1] 5 3

6. Summary Statistics

Summaries help you understand distributions quickly.

Overall Summary

summary(df)
##      name              gender              score     
##  Length:5           Length:5           Min.   :15.0  
##  Class :character   Class :character   1st Qu.:61.0  
##  Mode  :character   Mode  :character   Median :72.0  
##                                        Mean   :65.2  
##                                        3rd Qu.:80.0  
##                                        Max.   :98.0

For numeric variables, this gives: - Min, Max - Median - Mean (for many datasets)

Focus on Numeric Columns

summary(df$score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.0    61.0    72.0    65.2    80.0    98.0

7. Basic Data Visualization

Visualization helps reveal patterns that tables cannot.

Histogram (Distribution of Scores)

hist(df$score,
     main = "Distribution of Student Scores",
     xlab = "Score")

Boxplot (Comparing Groups)

boxplot(score ~ gender,
        data = df,
        main = "Scores by Gender",
        xlab = "Gender",
        ylab = "Score")

Scatter Plot (Basic Relationship)

plot(df$score,
     main = "Index vs Score",
     ylab = "Score",
     xlab = "Student Index")


8. Interpreting Results

When looking at summaries and plots, always ask:

Analytics is not just code — it is thinking with data.


9. Practice Tasks

  1. Load a CSV file of your choice and inspect it using str().
study <- data.frame(
  students = c("marvel", "yuppie", "inno", "derrick", "suzan"),
  mtc_score = c(96,78,43,77,69),
  sci_score = c(48,90,78,36,89),
  sst_score = c(67,89,09,86,84),
  eng_score = c(78,85,59,20,14)
)
study
write.csv(study,"study.csv")
read.csv("study.csv")
str(study)
## 'data.frame':    5 obs. of  5 variables:
##  $ students : chr  "marvel" "yuppie" "inno" "derrick" ...
##  $ mtc_score: num  96 78 43 77 69
##  $ sci_score: num  48 90 78 36 89
##  $ sst_score: num  67 89 9 86 84
##  $ eng_score: num  78 85 59 20 14
  1. Identify which variables are numeric and which are categorical.
summary(study)
##    students           mtc_score      sci_score      sst_score    eng_score   
##  Length:5           Min.   :43.0   Min.   :36.0   Min.   : 9   Min.   :14.0  
##  Class :character   1st Qu.:69.0   1st Qu.:48.0   1st Qu.:67   1st Qu.:20.0  
##  Mode  :character   Median :77.0   Median :78.0   Median :84   Median :59.0  
##                     Mean   :72.6   Mean   :68.2   Mean   :67   Mean   :51.2  
##                     3rd Qu.:78.0   3rd Qu.:89.0   3rd Qu.:86   3rd Qu.:78.0  
##                     Max.   :96.0   Max.   :90.0   Max.   :89   Max.   :85.0
str(study)
## 'data.frame':    5 obs. of  5 variables:
##  $ students : chr  "marvel" "yuppie" "inno" "derrick" ...
##  $ mtc_score: num  96 78 43 77 69
##  $ sci_score: num  48 90 78 36 89
##  $ sst_score: num  67 89 9 86 84
##  $ eng_score: num  78 85 59 20 14
  1. Produce a histogram for one numeric variable.
hist(study$mtc_score,
     main = "A histogram of MTC scores",
     xlab = "MTC scores",
     ylab = "frequency",
     col = "yellow")

4. Produce a boxplot comparing one numeric variable across groups.

# Base R
boxplot(sci_score ~ students, data = study,
        main = "A boxplot of sci scores of students",
        xlab = "Students",
        ylab = "SCI scores")

# ggplot
#install.packages("ggplot2") this installs the ggplots packages if they are not installed before 
library(ggplot2)
ggplot(data = study, aes(x = students, y = mtc_score)) +
  geom_boxplot() +
  labs(
    title = "A boxplot of mtc scores of students",
    x = "Students",
    y = "MTC scores",
    caption = "Source: own dataset"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

# hjust = 0.5 centers the title
# theme() changes the appearance e.g theme_minimal(), theme_bw(), theme_classic() and others 
# fill = "red" for applying color in the plot
  1. Write two sentences interpreting each plot.
# A box plot shows how the distribution of variables differs across groups, highlighting the median, spread, and any outliers. 
# if one group has a higher median and a wider box, it indicates higher typical values and greater variability compared to the others 

Key Takeaways


End of Weeks 1–2 notebook.