library(tidyverse)Data Science Workflow Example in Quarto
1 Introduction
1.1 Research Question
Does the number of study hours affect students’ exam scores?
1.2 Hypothesis
1.2.1 Null Hypothesis (H0)
Study hours have no significant effect on exam scores.
1.2.2 Alternative Hypothesis (H1)
Study hours positively affect exam scores.
2 Load Required Libraries
3 Data Import
In a real research project, you would import an existing dataset.
For demonstration purposes, we first create a sample dataset and save it as a CSV file.
students_demo <- data.frame(
student_id = 1:10,
study_hours = c(2,3,4,5,6,7,8,9,10,11),
exam_score = c(55,60,62,68,72,78,82,88,91,95)
)
write.csv(
students_demo,
"students.csv",
row.names = FALSE
)Now import the dataset.
students <- read.csv("students.csv")
head(students) student_id study_hours exam_score
1 1 2 55
2 2 3 60
3 3 4 62
4 4 5 68
5 5 6 72
6 6 7 78
4 Data Cleaning (Tidying)
Inspect the dataset structure.
str(students)'data.frame': 10 obs. of 3 variables:
$ student_id : int 1 2 3 4 5 6 7 8 9 10
$ study_hours: int 2 3 4 5 6 7 8 9 10 11
$ exam_score : int 55 60 62 68 72 78 82 88 91 95
Check for missing values.
colSums(is.na(students)) student_id study_hours exam_score
0 0 0
Remove missing values if they exist.
students <- students %>%
drop_na()Display summary statistics.
summary(students) student_id study_hours exam_score
Min. : 1.00 Min. : 2.00 Min. :55.0
1st Qu.: 3.25 1st Qu.: 4.25 1st Qu.:63.5
Median : 5.50 Median : 6.50 Median :75.0
Mean : 5.50 Mean : 6.50 Mean :75.1
3rd Qu.: 7.75 3rd Qu.: 8.75 3rd Qu.:86.5
Max. :10.00 Max. :11.00 Max. :95.0
5 Data Transformation
Create a new variable.
Suppose exam scores are out of 100. We create a performance category.
students <- students %>%
mutate(
performance = if_else(
exam_score >= 75,
"High",
"Low"
)
)
head(students) student_id study_hours exam_score performance
1 1 2 55 Low
2 2 3 60 Low
3 3 4 62 Low
4 4 5 68 Low
5 5 6 72 Low
6 6 7 78 High
Count students by performance category.
students %>%
count(performance) performance n
1 High 5
2 Low 5
6 Data Visualization
6.1 Scatter Plot
Visualize the relationship between study hours and exam scores.
ggplot(
students,
aes(
x = study_hours,
y = exam_score
)
) +
geom_point(size = 3) +
labs(
title = "Study Hours vs Exam Scores",
x = "Study Hours",
y = "Exam Score"
)6.2 Scatter Plot with Regression Line
ggplot(
students,
aes(
x = study_hours,
y = exam_score
)
) +
geom_point(size = 3) +
geom_smooth(
method = "lm",
se = TRUE
) +
labs(
title = "Linear Relationship Between Study Hours and Exam Scores",
x = "Study Hours",
y = "Exam Score"
)7 Modeling
7.1 Simple Linear Regression
Fit a simple linear regression model.
model <- lm(
exam_score ~ study_hours,
data = students
)
model
Call:
lm(formula = exam_score ~ study_hours, data = students)
Coefficients:
(Intercept) study_hours
45.358 4.576
7.2 Model Summary
summary(model)
Call:
lm(formula = exam_score ~ study_hours, data = students)
Residuals:
Min 1Q Median 3Q Max
-1.66061 -0.57727 -0.03939 0.58182 1.46061
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.3576 0.7601 59.67 6.91e-12 ***
study_hours 4.5758 0.1070 42.78 9.83e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9716 on 8 degrees of freedom
Multiple R-squared: 0.9956, Adjusted R-squared: 0.9951
F-statistic: 1830 on 1 and 8 DF, p-value: 9.832e-11
8 Interpretation of Results
The coefficient for study_hours indicates the expected change in exam score for each additional hour of study.
A positive coefficient suggests that increased study time is associated with higher exam scores.
The p-value helps determine whether the relationship is statistically significant.
9 Conclusion
This analysis demonstrates the complete data science workflow in Quarto:
- Research Design
- Data Import
- Data Cleaning
- Data Transformation
- Data Visualization
- Modeling
- Communication of Results
Quarto combines narrative text, R code, statistical analysis, and visualizations into a single reproducible document.
10 Communication (Rendering Output)
To generate the report:
- Save this file as
data_science_workflow.qmd - Open it in RStudio
- Click Render
- Quarto will generate an HTML report
The rendered report contains:
- Research question
- Hypothesis
- Tables
- Visualizations
- Regression results
- Conclusions