W09 - AI Impatience

Proposal Presentation

Michael Arteaga

IBM 6400, Cal Poly Pomona

Invalid Date

What is Labelled Data?

What is labelled data?

Labeled data means your dataset includes both input data and some form of classification, category, or target variable that tells you what each data point represents.

Labeled data might include:

Customer satisfaction scores labeled by demographic info (age group, region, etc.)
Purchase history labeled by campaign exposure (e.g., “email”, “social”, “TV”)
Product reviews labeled with sentiment (e.g., “positive”, “neutral”, “negative”)
Lead data labeled as “converted” or “not converted”

Labeled data helps with:

Clearer visualizations (e.g., bar charts grouped by labeled categories)
Filtering and grouping in tidy datasets (dplyr::group_by(), etc.)
Supervised learning where the “label” is the variable you’re trying to predict

Labelled Data in MSDM Culminating Project

Does your MSDM CEP involve labelled data? Explain why or why not.

In the context of our project, we do not have labelled data. While we have data available from the surveys given to participants, the data relies on responses to specific questions and doesn’t rely on a label to categorize it. For example, one of our questions asks “what’s your overall attitude towards this video?” This question requires a response from a likert scale, ranging from 1-7. In comparison, labelled data might look different in another context. If a survey was measuring how many people clicked on an ad, the label for a column could be “clicked ad” and the subsequent rows would denote whether that particpant clicked the ad, providing an easy way to determine the amount.

Initial and Further Data Cleaning

In Step 1, you saw two videos, one showing initial data cleaning and two showing further cleaning. Describe the two processes.

After watching the first video, it appeared to go over best practices for assessing the data rather than showing methods to clean it. Through different filters, we would be able to narrow down eligible participants from the survey so as to not muddy up or skew the data. This initial cleaning process would ensure that a solid foundation for a sample size would be used for further anaylis and wrangling of the data.
After filtering and screening the bad responses out, the data must be cleaned further due to some errors such as reversed items. Through variables that were created in the previous step, additional filtering will be applied to narrow down our eligible participants further using functions like count(), mutate(), and group_by(). For example, these will give us insight into how many patron visited certain venues like the spa, hair salon, coffee shop, etc. In some instances, re-coding will need to be applied because some questions have certain values omitted, causing inconsistent results with other responses. An example of this is for questions that require a response between 1 and 9, but the 2 and 3 responses are absent. Through the use of these different cleaning operations, we then end up with a pool of 736 eligible participants for the study.

Differences in Cleaning Methods

What are the big differences between the two? How do the differences result in the differences in the final sample size?

With the initial cleaning, variables are created to manage the participants and begin filtering them. While there is some screening happening in this stage, it is not a comprehensive cleaning of the data. It will help us weed out ineligible participants but the next stage of cleaning will help prepare the data further.

By doing a deeper diver into the data, we can use the variables to narrow down our eligible pool even further. By re-coding and using specific functions for groupong and filtering, we can also correct and include data that might otherwise have been omitted because it lacked certain criteria. This step would have been overlooked with the initial cleaning, but in this stage gets accounted for. The end of result of this deeper cleaning of the data is that we have a much more aligned pool of eligible participants that matches the critera our study needs. The initial cleaning would have provided a generally usable pool, but it would not have been as comprehensive.