Sino Yoga

Proposal Presentation

Erika Barajas

IBM 6400, Cal Poly Pomona

2026-04-03

Labeled Data

Prompt

In Step 1, you learned how to read SPSS data into RStudio and conduct EDA and hypothesis testing.

What is labeled data? Does your MSDM CEP involve labeled data? Explain why or why not.
What techniques are relevant for your MSDM CEP? Please elaborate on how these approaches might help you address your AO for CEP.

What is Labeled Data?

Labeled data refers to datasets where variables and values carry descriptive metadata
Common in SPSS, SAS, and STATA — e.g., a variable named q1 carries the label “Overall Satisfaction”
Numeric values like 1–5 might be labeled “Strongly Disagree” to “Strongly Agree”
In R, the haven and labelled packages allow us to read and work with this type of data

Does Sino Yoga Involve Labeled Data?

My Dataset

Originally collected via Qualtrics and downloaded as CSV
Does not natively contain SPSS-style labeled data in its current form
However, after reviewing this module’s content, re-exporting the data from Qualtrics as a .sav file may be beneficial
This would enable full variable and value labeling for our multi-item Likert-scale measures
Our group will discuss whether to make this change for the CEP

Relevant Techniques

recode() and case_when() for recoding variables
set_variable_labels() from the labelled package
EDA to understand response distributions
Reliability analysis (Cronbach’s alpha) and EFA to validate scales
Composite indices by averaging scale items

Initial Data Cleaning

Prompt

In Step 2, you saw the initial data cleaning process. Describe it and what lessons you got from it.

Initial Data Cleaning Process

Read the SPSS .sav file into R using read_sav() from the haven package
Cleaned variable names automatically using clean_names()
Combined two waves of Prolific survey data into a single dataset
Filtered by initial criteria — removing system-generated responses such as spam, previews, and test entries
Cleaned by demographic criteria to ensure sample validity
Inspected data structure using dim() and view_df() to understand dimensions and variable labels
Saved the cleaned dataset back to SPSS format

Lessons Learned

Raw survey data always contains system-generated junk responses that must be removed before any analysis.

Key Lesson 1: Always inspect your data immediately using dim() and view_df() — never assume it is clean.

Key Lesson 2: System-generated responses like spam, previews, and test entries are invisible problems that silently distort your results if not removed.

Key Lesson 3: Document every cleaning decision in Quarto for full transparency and reproducibility.

Further Data Cleaning

Prompt

In Step 3, you saw the further data cleaning process. Describe it and what the key takeaways are.

Further Data Cleaning Process

Created a custom freq_covid() function to generate frequency distribution tables efficiently
Used the function to inspect each variable’s distribution before and after recoding
Applied case_when() inside mutate() to recode variables and align value ranges
Renamed and relabeled variables for clarity and consistency
Conducted EDA throughout to verify recoding was applied correctly
Built composite scales after confirming psychometric validity

Key Takeaways

Takeaway 1

Recoding variables is a careful and deliberate process — each variable must be inspected individually to understand its distribution before any recoding logic is applied.

Takeaway 2

Writing custom functions like freq_covid() in R automates repetitive tasks, making the cleaning process more efficient, consistent, and reproducible across all variables.

Comparing the Two Approaches

Prompt

What are the big differences between the two approaches and the impact on the final sample size and response quality?

Side-by-Side Comparison

	Initial Cleaning	Further Cleaning
Level	Row level	Variable level
Focus	Completeness	Quality & consistency
Actions	Remove invalid/incomplete responses	Recode and relabel variable values
Impact on Sample Size	Largest reduction	Minimal row removal
Impact on Quality	Removes structurally invalid data	Improves analytical validity

Summary

The initial cleaning operated at the row level — removing participants with system-generated or invalid responses
The further cleaning operated at the variable level — recoding values using case_when() to ensure consistency
Together they create a dataset that is both structurally sound and analytically valid
The trade-off: each filtering decision reduces sample size, which must be weighed against the benefit of higher response quality
The goal is a final dataset that supports accurate and trustworthy analytical results for the Sino Yoga CEP

References

Jung, J. (2025). Transitioning from SPSS to R for Survey Research [Lecture video]. IBM 6400, Cal Poly Pomona.
Jung, J. (2025). Initial data cleaning [Lecture video]. IBM 6400, Cal Poly Pomona.
Jung, J. (2025). Further data cleaning [Lecture video]. IBM 6400, Cal Poly Pomona.