Sino Yoga

Proposal Presentation

Erika Barajas

IBM 6400, Cal Poly Pomona

2026-04-03

Labeled Data

Prompt

In Step 1, you learned how to read SPSS data into RStudio and conduct EDA and hypothesis testing.

  • What is labeled data? Does your MSDM CEP involve labeled data? Explain why or why not.

  • What techniques are relevant for your MSDM CEP? Please elaborate on how these approaches might help you address your AO for CEP.

What is Labeled Data?

  • Labeled data refers to datasets where variables and values carry descriptive metadata
  • Common in SPSS, SAS, and STATA — e.g., a variable named q1 carries the label “Overall Satisfaction”
  • Numeric values like 1–5 might be labeled “Strongly Disagree” to “Strongly Agree”
  • In R, the haven and labelled packages allow us to read and work with this type of data

Does Sino Yoga Involve Labeled Data?

My Dataset

  • Originally collected via Qualtrics and downloaded as CSV
  • Does not natively contain SPSS-style labeled data in its current form
  • However, after reviewing this module’s content, re-exporting the data from Qualtrics as a .sav file may be beneficial
  • This would enable full variable and value labeling for our multi-item Likert-scale measures
  • Our group will discuss whether to make this change for the CEP

Relevant Techniques

  • recode() and case_when() for recoding variables
  • set_variable_labels() from the labelled package
  • EDA to understand response distributions
  • Reliability analysis (Cronbach’s alpha) and EFA to validate scales
  • Composite indices by averaging scale items

Initial Data Cleaning

Prompt

In Step 2, you saw the initial data cleaning process. Describe it and what lessons you got from it.

Initial Data Cleaning Process

  • Read the SPSS .sav file into R using read_sav() from the haven package
  • Cleaned variable names automatically using clean_names()
  • Combined two waves of Prolific survey data into a single dataset
  • Filtered by initial criteria — removing system-generated responses such as spam, previews, and test entries
  • Cleaned by demographic criteria to ensure sample validity
  • Inspected data structure using dim() and view_df() to understand dimensions and variable labels
  • Saved the cleaned dataset back to SPSS format

Lessons Learned

Raw survey data always contains system-generated junk responses that must be removed before any analysis.

Key Lesson 1: Always inspect your data immediately using dim() and view_df() — never assume it is clean.

Key Lesson 2: System-generated responses like spam, previews, and test entries are invisible problems that silently distort your results if not removed.

Key Lesson 3: Document every cleaning decision in Quarto for full transparency and reproducibility.

Further Data Cleaning

Prompt

In Step 3, you saw the further data cleaning process. Describe it and what the key takeaways are.

Further Data Cleaning Process

  • Created a custom freq_covid() function to generate frequency distribution tables efficiently
  • Used the function to inspect each variable’s distribution before and after recoding
  • Applied case_when() inside mutate() to recode variables and align value ranges
  • Renamed and relabeled variables for clarity and consistency
  • Conducted EDA throughout to verify recoding was applied correctly
  • Built composite scales after confirming psychometric validity

Key Takeaways

Takeaway 1

Recoding variables is a careful and deliberate process — each variable must be inspected individually to understand its distribution before any recoding logic is applied.

Takeaway 2

Writing custom functions like freq_covid() in R automates repetitive tasks, making the cleaning process more efficient, consistent, and reproducible across all variables.

Comparing the Two Approaches

Prompt

What are the big differences between the two approaches and the impact on the final sample size and response quality?

Side-by-Side Comparison

Initial Cleaning Further Cleaning
Level Row level Variable level
Focus Completeness Quality & consistency
Actions Remove invalid/incomplete responses Recode and relabel variable values
Impact on Sample Size Largest reduction Minimal row removal
Impact on Quality Removes structurally invalid data Improves analytical validity

Summary

  • The initial cleaning operated at the row level — removing participants with system-generated or invalid responses
  • The further cleaning operated at the variable level — recoding values using case_when() to ensure consistency
  • Together they create a dataset that is both structurally sound and analytically valid
  • The trade-off: each filtering decision reduces sample size, which must be weighed against the benefit of higher response quality
  • The goal is a final dataset that supports accurate and trustworthy analytical results for the Sino Yoga CEP

References

References

  • Jung, J. (2025). Transitioning from SPSS to R for Survey Research [Lecture video]. IBM 6400, Cal Poly Pomona.
  • Jung, J. (2025). Initial data cleaning [Lecture video]. IBM 6400, Cal Poly Pomona.
  • Jung, J. (2025). Further data cleaning [Lecture video]. IBM 6400, Cal Poly Pomona.