M06 Reflection on Labelled Data

Advanced Data Wrangling

Juan Christian De La Cruz - Calderon

IBM 6530, Cal Poly Pomona

2026-02-23

Prompt 1

Question 1:

What is labelled data? Does your MSDM CEP involve labelled data? Explain why or why not.

What techniques are relevant for your MSDM CEP? Please elaborate on how these approaches might help you address your AO for CEP.

My Response:

1A)
In the Step 1 video, the instructor explains that labelled data refers to variables that store both a numeric code and a descriptive label. This structure is common in survey datasets, especially those created in SPSS, because it preserves the numeric values needed for statistical analysis while also retaining the human‑readable categories that make interpretation clearer. For example, a Likert‑scale item may record a response as 4 but also carry the label “Agree.”

My MSDM CEP does involve labelled data because my project relies on structured survey responses that measure constructs such as attitudes, satisfaction, and behavioral intentions. These variables are coded numerically but include descriptive labels that clarify what each response category represents. Keeping these labels intact is important for both exploratory data analysis and for communicating results in a way that is meaningful to stakeholders.

1B)

Several techniques demonstrated in Step 1 directly support the analytical needs of my CEP and help me address my AO:

  • Reading Labelled SPSS Data into R

    • Using the haven package ensures that variable labels and value labels are preserved. This is essential for interpreting survey constructs accurately during analysis.
  • Exploratory Data Analysis

    • Inspecting distributions, checking variable types, and identifying missing or inconsistent values helps me understand the structure and quality of my dataset before running any models. This step ensures that the variables tied to my AO behave as expected.
  • Working with Labelled Variables

    • Because my CEP uses survey‑based measures, retaining labels improves interpretability when summarizing results or presenting findings.
  • Hypothesis Testing

    • Group comparisons and significance tests help evaluate whether differences in attitudes or behaviors are meaningful. These techniques support my AO by allowing me to test relationships between key variables and validate assumptions about user behavior.

Prompt 2

Question 2:

In Step 2, you saw the initial data cleaning process. Describe it and what lessons you got from it.

My Response:

The initial cleaning stage was really about getting a clear picture of what the dataset looked like before making any major decisions. This involved confirming that the file imported correctly, reviewing how each variable was classified, scanning for missing values, and running basic summaries to understand the overall structure. What stood out to me during this step was how often datasets contain subtle issues that only become visible once you start exploring them. For example, labelled variables may appear numeric but behave categorically, and missing values may be stored as custom codes rather than true NA values. The main lesson from this stage is that early exploratory review is essential; it sets the foundation for every decision that follows and prevents misinterpretation later in the analysis.

Prompt 3

Question 3:

In Step 3, you saw the further data cleaning process. Describe it and what the key takeaways are.

My Response:

The next phase of cleaning took a more evaluative approach, shifting from structural preparation to assessing the quality and credibility of the responses. This stage required more deliberate choices, such as applying stricter filters, recoding variables for consistency, and determining which cases were appropriate to keep based on the study’s goals. Rather than simply preparing the dataset, this step focused on strengthening its analytical integrity. One of the key takeaways is that data cleaning is an interpretive process — the decisions made here directly influence the conclusions that can be drawn. Although removing inconsistent or low‑quality responses may reduce the sample size, it ultimately leads to more trustworthy and defensible results.

Prompt 4

Question 4:

What are the big differences between the two approaches and the impact on the final sample size and response quality? 

My Response:

The two cleaning stages differ both in purpose and in impact. The initial cleaning phase is about ensuring the dataset is structurally sound and ready for analysis, focusing on accessibility and basic organization. The later cleaning phase is more rigorous, applying tighter criteria to refine the dataset and improve response quality. As a result, the first stage tends to preserve a larger sample, while the second stage intentionally narrows it to enhance validity. These differences have meaningful consequences: a larger sample increases statistical power, but if it includes noise or invalid responses, the results may be misleading. A smaller, more carefully curated sample improves internal validity and strengthens confidence in the findings. Effective analysis requires balancing these priorities rather than assuming that more data automatically leads to better insights.