Nia Pillow Data 101 Outline: Final Project May 2022 Professor M.

  1. Data What data set are you planning to use? Source, URL, size, description

I plan on using the Sephora Website Dataset from Kaggle. The original owner is …. She/he was able to gather all the data by scraping the Sephora Website for variables such as size, love, rating, the number of customers who left reviews, ingredients, brand, name, and category. This dataset contains 9800 observations and 21 variables.

Does it contain both quantitative and categorical values?

My data does contain both quantitative and categorical values. Quantitative values include price, rating, love, number of reviews, and value price. The categorical values include brand, name, category, and ingredients.

Does it have any personal importance for you?

I wouldn’t say this project has had any personal importance to me other than simply loving to purchase and apply make-up! My mother went to cosmetology school and made money doing hair when I grew up and that was the only job where she and I could spend a lot of time together. I was surprised to learn that Sephora Beauty carries so many brands online than they do in the physical store, so I had a lot of data to work with.

  1. Questions

What questions are you considering? 1. I want to see if there is a relationship between price points and product size. 2. I would also like to see if there is a relationship between price point and brand rating 3. I want to see what facial product had the highest rating and look at the cost 4. I want to see what product was the most popular – use a frequency table 5. I want to filter the data and look to see which brand had the best-rated face serum, face wash, and moisturizer.

Are there any potential challenges regarding these questions? Could you use more than one data set to answer your questions?

I always have initial questions about my research when I begin a project. Those questions evolve into other questions as I continue working through my project. Some of the answers to my questions just aren’t available within the data. To create an additional dataset, I could scrape the Sephora Website myself for updated data.

  1. EDA

How will you conduct EDA with dplyr? Think about the ways you need to organize your data by subsetting it, ranking it, or adding columns.

These are my most used and two favorite dply() functions. Select() – I am always selecting different variables throughout my projects to look at different relationships. Filter() – I use this function the most to look at rankings, detailed observations, price points, etc.

What basic charts do you want to create? Will these are Basic R, ggplot, or a combination?

I will need to create a combination of both Basic R and ggplot. I tried to use ggplot charts, but I decided to create my chart along with the basic R plot() and bar() plot functions. Why do the chart types you have chosen make sense in answering your questions?

  1. Statistical Analysis

What variables will you use to conduct the two-sample t-test? Would a paired t-test make sense? Set up the null hypothesis and check assumptions.

I will use the price and value price variable from the Sephora dataset to conduct my two-sample t-test. What variables will you use for the Chi-Squared test? Set up the null hypothesis and check assumptions. For my Chi-Square test, I will use the brand as my main variable to measure against other categorical variables in my dataset
H0: The variables are not associated i.e., are independent. (Null Hypothesis) H1: The variables are associated, i.e., are dependent. (Alternative Hypothesis)

  1. Summaries

• Introduction • Zero Missing Values • Variance, Two-Sample T-test • Chi-Squared test • Dr. Jart+: Winner – Price point, reviews, marketing, etc. • Chart: Chi-Squared • Chart: Value Price vs Price: No significance