R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

#Warm-up activity for fun - can you replace my name with yours using the following functions?

paste("Today is", date())

## [1] "Today is Fri Sep 19 04:20:32 2025"

name <- "Clarissa"
state <- "California"
print(name) #this syntax is more intuitive

## [1] "Clarissa"

paste(name, "lives in", state)

## [1] "Clarissa lives in California"

Research design issues

In this study, we analyze the effectiveness of ads on sales using Welch’s independent sample t-test. Here independent means that points (i.e.,customers in this case) do not match up with each other. Alternatively, for instance, we might perform a paired sample t test in which we could test if a before (let consumers be exposed to natural themes) and� � after (let consumers be exposed to family themes) condition will affect the sales� � of each store.

Variable Description

Sales: Total unit sales of the grape juice in one week in a store Price: Average unit price of the grape juice in the week ad_type: The in-store advertisement type to promote the grape juice.ad_type = 0, the theme of the ad is natural production of the juice ad_type = 1, the theme of the ad is family health caring price_apple: Average unit price of the apple juice in the same store in the week price_cookies: Average unit price of the cookies in the same store in the week

Step 1:

Please write a null hypothesis and an alternative hypothesis using the template hypotheses available in the research design module.

Null Hypothesis: The mean sales of grape juice are the same for natrual ads and family-health ads. Alternative hypothesis: The mean sales of grape juice differ between natrual ads and family-health ads.

Step 2:

Please make your conclusions based on the results in descriptive analysis 3. What is your conclusion?

Based on Descriptive Analysis 3, the mean sales for family health-ads (246.67) are higher than natrual ads (186.67). This indicates that feamly-health ads may be more effective.

We performed a normality test in Step - normality check 1. What is your conclusion?

From the normality tests, both groups have p-values greater than 0.05, so we conclude that the sales are approximately normally distributed. Therefore, the assumptions for the t-test are satisfied.

Hint: read the third reference article.

data <- read.csv("grapeJuice.csv")
summary(data)

##        X             Sales           price           ad_type     price_apple   
##  Min.   : 1.00   Min.   :131.0   Min.   : 8.200   Min.   :0.0   Min.   :7.300  
##  1st Qu.: 8.25   1st Qu.:182.5   1st Qu.: 9.585   1st Qu.:0.0   1st Qu.:7.438  
##  Median :15.50   Median :204.5   Median : 9.855   Median :0.5   Median :7.580  
##  Mean   :15.50   Mean   :216.7   Mean   : 9.738   Mean   :0.5   Mean   :7.659  
##  3rd Qu.:22.75   3rd Qu.:244.2   3rd Qu.:10.268   3rd Qu.:1.0   3rd Qu.:7.805  
##  Max.   :30.00   Max.   :335.0   Max.   :10.490   Max.   :1.0   Max.   :8.290  
##  price_cookies   
##  Min.   : 8.790  
##  1st Qu.: 9.190  
##  Median : 9.515  
##  Mean   : 9.622  
##  3rd Qu.:10.140  
##  Max.   :10.580

# Compare sales by ad type
tapply(data$Sales, data$ad_type, mean)

##        0        1 
## 186.6667 246.6667

tapply(data$Sales, data$ad_type, sd)

##        0        1 
## 35.86416 50.50413

Step 3: Perform a t-test using Excel or R

t.test(Sales ~ ad_type, data = data, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  Sales by ad_type
## t = -3.7515, df = 25.257, p-value = 0.0009233
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -92.92234 -27.07766
## sample estimates:
## mean in group 0 mean in group 1 
##        186.6667        246.6667

p < 0.05 so we reject the null hypothesis. There is a statistically significant difference in mean sales between the ad types.

Step 3.1 Perform a t-test using Excel

In this step, you will be performing a t-test using Excel. Once you get the result, please attach your output Spreadsheet in the discussion forum.

Reference: Excel - Independent samples Welch t test (via data analysis) https://www.youtube.com/watch?v=sHqCrK_FMyY

Step 3.2 (Preparation & debugging)

In this step, you will be performing the t-test again using R and R studio. The goal is to help you document your analysis for future reference.

Please try to perform the analysis using R and Rpubs before the class on Thursday and post your final bugs and errors (or the final URL of your Rpubs page) to receive participation credits toward your final Engagement grade.

Note: For details about “Preparation & debugging,” please read the section “Preparation & debugging” in the Syllabus or the Syllabus page of LMS.

The “Preparation & debugging” process can be frustrating for statistics majors sometimes. Do not be panic!!! I hope you could recognize the challenge as an opportunity for you to build a stronger sense of self. You may find the following testimony by Thomas Mock helpful. Please also try to watch the YouTube video “R Programming Tutorial - Learn the Basics of Statistical Computing” to get familiar with the R basics.

“Within the first month of the course I actually reverted back to doing things in Systat with a GUI as I was so frustrated with not knowing what I was doing in R.”

References:

My R Journey: Thomas Mock https://rfortherestofus.com/2019/09/my-r-journey-thomas-mock/

R Programming Tutorial - Learn the Basics of Statistical Computing: https://www.youtube.com/watch?v=_V8eKsto3Ug

Step 4: Wrap-up - interpret the results

We performed a Welch’s t test in the step 3. What is your conclusion?

The descriptive analysis showed family ads have higher mean sales. The normality assumption is met. The t-test confirmed the difference is statistically significant. The family-health ad is more effective in increasing grape juice sales and should be chosen for rollout.

Hint: read the first three reference articles. Make sure to cite.

In Class Participation (Histogram for Different Variable)

Histogram for price_apple

hist(data$price_apple, main="Histogram: price_apple", xlab="Price Apple", prob=TRUE)
lines(density(data$price_apple), lty="dashed", lwd=2.5, col="red")

Descriptive Analysis 1:

data <- read.csv('grapeJuice.csv') #read data
str(data)

## 'data.frame':    30 obs. of  6 variables:
##  $ X            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sales        : int  222 201 247 169 317 227 214 187 188 275 ...
##  $ price        : num  9.83 9.72 10.15 10.04 8.38 ...
##  $ ad_type      : int  0 1 1 0 1 0 1 0 1 0 ...
##  $ price_apple  : num  7.36 7.43 7.66 7.57 7.33 7.51 7.57 7.66 7.39 8.29 ...
##  $ price_cookies: num  8.8 9.62 8.9 10.26 9.54 ...

head(data)  #view the first 6 lines

##   X Sales price ad_type price_apple price_cookies
## 1 1   222  9.83       0        7.36          8.80
## 2 2   201  9.72       1        7.43          9.62
## 3 3   247 10.15       1        7.66          8.90
## 4 4   169 10.04       0        7.57         10.26
## 5 5   317  8.38       1        7.33          9.54
## 6 6   227  9.74       0        7.51          9.49

tail(data) #view the last 6 lines

##     X Sales price ad_type price_apple price_cookies
## 25 25   335  8.34       1        8.23          9.13
## 26 26   145 10.27       0        7.41         10.58
## 27 27   201 10.26       1        7.67          9.22
## 28 28   131 10.49       0        7.59         10.43
## 29 29   210 10.36       0        7.93          9.44
## 30 30   279  8.56       1        7.65         10.44

#perform some basic descriptive analysis 
summary(data)

##        X             Sales           price           ad_type     price_apple   
##  Min.   : 1.00   Min.   :131.0   Min.   : 8.200   Min.   :0.0   Min.   :7.300  
##  1st Qu.: 8.25   1st Qu.:182.5   1st Qu.: 9.585   1st Qu.:0.0   1st Qu.:7.438  
##  Median :15.50   Median :204.5   Median : 9.855   Median :0.5   Median :7.580  
##  Mean   :15.50   Mean   :216.7   Mean   : 9.738   Mean   :0.5   Mean   :7.659  
##  3rd Qu.:22.75   3rd Qu.:244.2   3rd Qu.:10.268   3rd Qu.:1.0   3rd Qu.:7.805  
##  Max.   :30.00   Max.   :335.0   Max.   :10.490   Max.   :1.0   Max.   :8.290  
##  price_cookies   
##  Min.   : 8.790  
##  1st Qu.: 9.190  
##  Median : 9.515  
##  Mean   : 9.622  
##  3rd Qu.:10.140  
##  Max.   :10.580

Descriptive Analysis 2:

#set the 1 by 2 layout plot window
par(mfrow=c(1,2))
#Check if there are outliers using a boxplot
#Let's perform boxplots in two different ways
boxplot(data$Sales,main="Boxplot for sales data", ylab="Sales")
boxplot(data$Sales,main="Boxplot for sales data", horizontal = TRUE, xlab="Sales")

#Let's perform a histogram analysis
hist(data$Sales,main='histogram plot for sales data',xlab='sales_grape',prob=T)
lines(density(data$Sales),lty='dashed',lwd=2.5, col='blue')

It seems that there is no outlieer and the distribution of the data is roughly normal.

Descriptive analysis 3 - Compare the mean of sales with the two different ad types

#divide the dataset into two sub dataset by ad_type
sales_ad_nature = subset(data,ad_type==0)
sales_ad_family = subset(data,ad_type==1)

#calculate the mean of sales with different ad_type
mean(sales_ad_nature$Sales)

## [1] 186.6667

mean(sales_ad_family$Sales)

## [1] 246.6667

Assumption check 1

The assumptions of t-tests assumes the observations are normally distributed and independent.

#set the 1 by 2 layout plot window
par(mfrow = c(1,2))

# Explore the distribution of the data using histogram
hist(sales_ad_nature$Sales,main="",xlab="sales with nature theme ad",prob=T)
lines(density(sales_ad_nature$Sales),lty="dashed",lwd=2.5,col="red")

hist(sales_ad_family$Sales,main="",xlab="sales with family theme ad",prob=T)
lines(density(sales_ad_family$Sales),lty="dashed",lwd=2.5,col="red")

#set the 1 by 2 layout plot window
par(mfrow = c(1,2))

# boxplot to check if there are outliers in each group
boxplot(sales_ad_family$Sales,horizontal = TRUE, xlab="sales with family theme ad")
boxplot(sales_ad_nature$Sales,horizontal = TRUE, xlab="sales with nature theme ad")

Let’s build a more elegant boxplot with ggplot (the most elegant and aesthetically pleasing graphics framework available)

First we convert the variable ad_type from a numeric to a factor variable

data$ad_type <- as.factor(data$ad_type)
head(data)

##   X Sales price ad_type price_apple price_cookies
## 1 1   222  9.83       0        7.36          8.80
## 2 2   201  9.72       1        7.43          9.62
## 3 3   247 10.15       1        7.66          8.90
## 4 4   169 10.04       0        7.57         10.26
## 5 5   317  8.38       1        7.33          9.54
## 6 6   227  9.74       0        7.51          9.49

# Import the ggplot library
library(ggplot2)
# Wait for the magic to happen
ggplot(data, aes(x=ad_type, y=Sales, fill=ad_type))+
  geom_boxplot(outlier.shape = NA, alpha=.5) +
  geom_jitter(width=.1, size=1) +
  theme_classic() +
  scale_fill_manual(values=c("lightseagreen","darkseagreen"))

Assumption check 2

In this step, we perform a Shapiro test to see if our data is from a normaly distributed population.

shapiro.test(sales_ad_nature$Sales)

## 
##  Shapiro-Wilk normality test
## 
## data:  sales_ad_nature$Sales
## W = 0.94255, p-value = 0.4155

shapiro.test(sales_ad_family$Sales)

## 
##  Shapiro-Wilk normality test
## 
## data:  sales_ad_family$Sales
## W = 0.89743, p-value = 0.08695

t-test

Performing a t-test with which has two categories (e.g., Controlled and Treated) helps us understand if there are differences in the population means between the two groups.

mu=0 refers to the null hypothesis that the difference between Control and Treated is 0, and hence they are similar. alt= two.sided refers to the a two sided t test. conf=0.95 is the confidence interval.

t.test(Sales ~ ad_type, data)

## 
##  Welch Two Sample t-test
## 
## data:  Sales by ad_type
## t = -3.7515, df = 25.257, p-value = 0.0009233
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -92.92234 -27.07766
## sample estimates:
## mean in group 0 mean in group 1 
##        186.6667        246.6667

library(pander)
panderOptions('round',4)
panderOptions('digits',7)
panderOptions('keep.trailing.zeros',TRUE)
panderOptions("table.split.table", Inf) 
pander(t.test(sales_ad_nature$Sales,sales_ad_family$Sales))

Welch Two Sample t-test: `sales_ad_nature$Sales` and `sales_ad_family$Sales`
Test statistic	df	P value	Alternative hypothesis	mean of x	mean of y
-3.7515	25.2571	9e-04 * * *	two.sided	186.6667	246.6667

Google & StackOverflow are definitely the top go-to choices for developers and programmers at any level

For R-related questions, use https://stackoverflow.com/questions/tagged/r

For statistics related questions, use https://stats.stackexchange.com.

Data Science Specialization at John Hopkins University, https://www.coursera.org/specializations/jhu-data-science

References

Stuart Frisby, Booking.com - Conversions@Google 2017. https://www.youtube.com/watch?v=_sx5LV23hIE

Design Testing at Netflix https://www.youtube.com/watch?v=-Gy8TnoXZf8

Mobile A/B Testing Results Analysis: Statistical Significance, Confidence Level and Intervals https://splitmetrics.com/blog/mobile-a-b-testing-statistical-significance/

Gemini: Wayfair s advanced marketing test design and measurement platform https://tech.wayfair.com/data-science/2019/07/gemini-wayfairs-advanced-marketing-test-design-and-measurement-platform/

Two Independent Samples Unequal Variance (Welch s Test) https://sites.nicholas.duke.edu/statsreview/means/welch/

ANOVA, t-tests and regression: different ways of showing the same thing http://deevybee.blogspot.com/2017/11/anova-t-tests-and-regression-different.html

The Independent Samples t-test (Welch Test) https://stats.libretexts.org/Bookshelves/Applied_Statistics/Book%3A_Learning_Statistics_with_R_-_A_tutorial_for_Psychology_Students_and_other_Beginners_(Navarro)/13%3A_Comparing_Two_Means/13.04%3A_The_Independent_Samples_t-test_(Welch_Test)

https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test