Data Science Workflow Example in Quarto

Author

Your Name

Published

June 25, 2026

1 Introduction

1.1 Research Question

Does the number of study hours affect students’ exam scores?

1.2 Hypothesis

1.2.1 Null Hypothesis (H0)

Study hours have no significant effect on exam scores.

1.2.2 Alternative Hypothesis (H1)

Study hours positively affect exam scores.

2 Load Required Libraries

library(tidyverse)

3 Data Import

In a real research project, you would import an existing dataset.

For demonstration purposes, we first create a sample dataset and save it as a CSV file.

students_demo <- data.frame(
  student_id = 1:10,
  study_hours = c(2,3,4,5,6,7,8,9,10,11),
  exam_score = c(55,60,62,68,72,78,82,88,91,95)
)

write.csv(
  students_demo,
  "students.csv",
  row.names = FALSE
)

Now import the dataset.

students <- read.csv("students.csv")

head(students)
  student_id study_hours exam_score
1          1           2         55
2          2           3         60
3          3           4         62
4          4           5         68
5          5           6         72
6          6           7         78

4 Data Cleaning (Tidying)

Inspect the dataset structure.

str(students)
'data.frame':   10 obs. of  3 variables:
 $ student_id : int  1 2 3 4 5 6 7 8 9 10
 $ study_hours: int  2 3 4 5 6 7 8 9 10 11
 $ exam_score : int  55 60 62 68 72 78 82 88 91 95

Check for missing values.

colSums(is.na(students))
 student_id study_hours  exam_score 
          0           0           0 

Remove missing values if they exist.

students <- students %>%
  drop_na()

Display summary statistics.

summary(students)
   student_id     study_hours      exam_score  
 Min.   : 1.00   Min.   : 2.00   Min.   :55.0  
 1st Qu.: 3.25   1st Qu.: 4.25   1st Qu.:63.5  
 Median : 5.50   Median : 6.50   Median :75.0  
 Mean   : 5.50   Mean   : 6.50   Mean   :75.1  
 3rd Qu.: 7.75   3rd Qu.: 8.75   3rd Qu.:86.5  
 Max.   :10.00   Max.   :11.00   Max.   :95.0  

5 Data Transformation

Create a new variable.

Suppose exam scores are out of 100. We create a performance category.

students <- students %>%
  mutate(
    performance = if_else(
      exam_score >= 75,
      "High",
      "Low"
    )
  )

head(students)
  student_id study_hours exam_score performance
1          1           2         55         Low
2          2           3         60         Low
3          3           4         62         Low
4          4           5         68         Low
5          5           6         72         Low
6          6           7         78        High

Count students by performance category.

students %>%
  count(performance)
  performance n
1        High 5
2         Low 5

6 Data Visualization

6.1 Scatter Plot

Visualize the relationship between study hours and exam scores.

ggplot(
  students,
  aes(
    x = study_hours,
    y = exam_score
  )
) +
  geom_point(size = 3) +
  labs(
    title = "Study Hours vs Exam Scores",
    x = "Study Hours",
    y = "Exam Score"
  )

6.2 Scatter Plot with Regression Line

ggplot(
  students,
  aes(
    x = study_hours,
    y = exam_score
  )
) +
  geom_point(size = 3) +
  geom_smooth(
    method = "lm",
    se = TRUE
  ) +
  labs(
    title = "Linear Relationship Between Study Hours and Exam Scores",
    x = "Study Hours",
    y = "Exam Score"
  )

7 Modeling

7.1 Simple Linear Regression

Fit a simple linear regression model.

model <- lm(
  exam_score ~ study_hours,
  data = students
)

model

Call:
lm(formula = exam_score ~ study_hours, data = students)

Coefficients:
(Intercept)  study_hours  
     45.358        4.576  

7.2 Model Summary

summary(model)

Call:
lm(formula = exam_score ~ study_hours, data = students)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.66061 -0.57727 -0.03939  0.58182  1.46061 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.3576     0.7601   59.67 6.91e-12 ***
study_hours   4.5758     0.1070   42.78 9.83e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9716 on 8 degrees of freedom
Multiple R-squared:  0.9956,    Adjusted R-squared:  0.9951 
F-statistic:  1830 on 1 and 8 DF,  p-value: 9.832e-11

8 Interpretation of Results

The coefficient for study_hours indicates the expected change in exam score for each additional hour of study.

A positive coefficient suggests that increased study time is associated with higher exam scores.

The p-value helps determine whether the relationship is statistically significant.

9 Conclusion

This analysis demonstrates the complete data science workflow in Quarto:

  1. Research Design
  2. Data Import
  3. Data Cleaning
  4. Data Transformation
  5. Data Visualization
  6. Modeling
  7. Communication of Results

Quarto combines narrative text, R code, statistical analysis, and visualizations into a single reproducible document.

10 Communication (Rendering Output)

To generate the report:

  1. Save this file as data_science_workflow.qmd
  2. Open it in RStudio
  3. Click Render
  4. Quarto will generate an HTML report

The rendered report contains:

  • Research question
  • Hypothesis
  • Tables
  • Visualizations
  • Regression results
  • Conclusions