Proposal Presentation
IBM 6400, Cal Poly Pomona
2026-04-03
In Step 1, you learned how to read SPSS data into RStudio and conduct EDA and hypothesis testing.
What is labeled data? Does your MSDM CEP involve labeled data? Explain why or why not.
What techniques are relevant for your MSDM CEP? Please elaborate on how these approaches might help you address your AO for CEP.
q1 carries the label “Overall Satisfaction”haven and labelled packages allow us to read and work with this type of dataMy Dataset
.sav file may be beneficialRelevant Techniques
recode() and case_when() for recoding variablesset_variable_labels() from the labelled packageIn Step 2, you saw the initial data cleaning process. Describe it and what lessons you got from it.
.sav file into R using read_sav() from the haven packageclean_names()dim() and view_df() to understand dimensions and variable labelsRaw survey data always contains system-generated junk responses that must be removed before any analysis.
Key Lesson 1: Always inspect your data immediately using dim() and view_df() — never assume it is clean.
Key Lesson 2: System-generated responses like spam, previews, and test entries are invisible problems that silently distort your results if not removed.
Key Lesson 3: Document every cleaning decision in Quarto for full transparency and reproducibility.
In Step 3, you saw the further data cleaning process. Describe it and what the key takeaways are.
freq_covid() function to generate frequency distribution tables efficientlycase_when() inside mutate() to recode variables and align value rangesTakeaway 1
Recoding variables is a careful and deliberate process — each variable must be inspected individually to understand its distribution before any recoding logic is applied.
Takeaway 2
Writing custom functions like freq_covid() in R automates repetitive tasks, making the cleaning process more efficient, consistent, and reproducible across all variables.
What are the big differences between the two approaches and the impact on the final sample size and response quality?
| Initial Cleaning | Further Cleaning | |
|---|---|---|
| Level | Row level | Variable level |
| Focus | Completeness | Quality & consistency |
| Actions | Remove invalid/incomplete responses | Recode and relabel variable values |
| Impact on Sample Size | Largest reduction | Minimal row removal |
| Impact on Quality | Removes structurally invalid data | Improves analytical validity |
case_when() to ensure consistencySino Yoga