M06-Reflection Essay-Adv Data-Wrangling

IBM 6530 | Data Cleansing Presentation

Ashley Lee

IBM 6400, Cal Poly Pomona

2026-02-23

Table of Contents

Step 1: In Read SPSS data into RStudio

  • Labelled Data

  • Relevant Techniques for MSDM CEP

Step 2: Initial Data Cleansing Process

Step 3: Further Data Cleaning Process

Reflection of Two Approaches

Step 1: Read SPSS Data Into RStudio

What is labelled data?

Labelled data is a very interesting feature that is seen most typically in SPSS, where variables are often labeled to provide a description name for said variable.

  • For example, when we store values in SPSS, this data can be given a numeral number that is accompadined with text description. In a ranking of 1-5, the labels can be given as “Strongly Disagree” to “Strongly Agree” or vise versa.

  • In an example like this, it’s important to define what number equates to what ranking, as these definitions can be different for each reader. However, when SPSS is read into R, it can sometimes not translate well and needs to be converted to a factor.

  • In doing so, it can keep the metadata for SPSS inside R.

Does your MSDM CEP involve labelled data?

Currently, my CEP may not involve labelled data. If we were a survey-based company, or were looking for more data attributes from external sources other than GA4, I could see a world where labelled data will need to be used. However, if we were to ask for review satisfaction from website or email users, I could see a world where our data would need to be labelled.

Cont.

What techniques are relevant for your MSDM CEP? Please elaborate on how these approaches might help you address your AO for CEP.

Two techniques I reviewed in this lecture that I found helpful includes:

  • RStudio Interface: Toggling back between visual and source mode was very helpful on R. I find myself often using this feature when trying to check what my work might look like once rendered, or clean up any code errors if something is not producing how I thought it would once rendered.

    Important

    I will definitely use this feature when preparing my AO CEP presentation as a gut check before publishing. 

  • GitHub for Version Control: I don’t use GitHub often, but I do enjoy the features that come with collaboration and publication. One handy note I learned is to use this for version controls, since code can change over time where some aspects might work, but other sections might not. A version control allows a quick check to see if there are differences between the cloud and local drive to avoid duplication efforts.

    Important

    In the context of the CEP, I can see myself using this to store pulled/parsed data since the challenge with our TRI client is getting the accurate data from all sources of information. 

Step 2: Initial Data Cleaning Process

The data preparation was a great insight to see how Dr. Jung analyzes a bulk of data points and the steps he takes to get it to a much more readable experience. In this lecture, marketing research data was collected Nov-Dec ’23 and was imported into R from SPSS. Some takeaways I learned include:

  • Clean by criteria: In this example, Dr. Jung looked at cleaning by consent and customer patronage, to name a few. In cases where folks responded against some of the consent/pre-requisite questions, they would be removed in the survey results or to remove bots.

    Note

    Some examples included attention check and comprehensive check questions which seemed fitting to use. 

  • Once you have the final data, the next step is to move onto script data preparation. This is helpful when combining several sources of data together and linking them into identical columns. While there is some level of uniformity that needs to happen prior, this is a great way to make sure all the data you’re looking at is all in one place. 

  • A further step to take is really looking at what the data didn’t show, including any non-responses, a full and quick breakdown of table comparisons based on labeled data and categorized columns. 

Step 3: Further Data Cleaning Process

This section dived further into the data cleaning process of pandemic responses in hospitality establishments. Some areas of leaning included:

  • Deeper insight into the metrics and responses of each question asked in the survey for efficient and summarized data points based on specific code chunks. This felt especially helpful when thinking about parsing data back and forth and not having to scroll up and down to find the data in different areas. This lecture showed a clear scroll of what each result meant for each category. 

  • Data frame & labels: I thought this section was especially helpful when thinking about how to show surveyor results from the Step 1 video. In Step 1, we are briefly shown how labels can be integrated into R, but this video was a nice full-circle moment in which we can see what labels were associated with each value, so that it acts as a nice reference gallery.

  • Prompting: It was very interesting to see how you can dig deeper into the data if you know what questions you’re looking for. For example, when looking at time washing hands or vaccine notices, we can zoom into the data vs. just a summary of it in order to analyze what mutations have happened, if any, and what our threshold of comfortability is.

    • Because these can be seen and interpreted in different ways by different teams, it is a great way to map out exactly what you have asked R to dig into further. 

Conclusion

Overall, the initial data cleansing was a good exercise of what things you can filter out if you have an understanding of exactly what you are looking into.

  • For example, if your findings only apply to an audience 18+, it would not make sense to include responses from ages 17 & younger. Initial data cleaning is great as a starting point in terms of tuning out what you might not need, but further cleansing is required to make sure the remaining data is accurate.

This additional layer of complexity, while tedious, can be rewarding in contexts where you are required to analyze each question asked in the survey, or asked to provide a summary of aggregated responses. The impact on the final sample size & response quality can vary quite differently since the start of the receiving the data once cleansing methods are applied.