#setwd("C:/Users/zxu3/Documents/R/abtesting")
#Please install the following package if the package "readr" is not installed.
#install.packages("readr")
library(readr)
data <- read_csv("abtesting.csv")
## Rows: 38 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Ads, Purchase
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ls(data) # list the variables in the dataset
## [1] "Ads" "Purchase"
head(data) #list the first 6 rows of the dataset
## # A tibble: 6 × 2
## Ads Purchase
## <dbl> <dbl>
## 1 1 113
## 2 0 83
## 3 0 52
## 4 1 119
## 5 1 188
## 6 0 99
# creating the factor variable
data$Ads <- factor(data$Ads)
is.factor(data$Ads)
## [1] TRUE
# showing the first 15 rows of the variable "Ads"
data$Ads[1:15]
## [1] 1 0 0 1 1 0 0 1 1 1 0 0 0 1 0
## Levels: 0 1
#now we do the regression analysis and examine the results
summary(lm(Purchase~Ads, data = data))
##
## Call:
## lm(formula = Purchase ~ Ads, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.000 -23.250 3.071 22.643 51.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.429 6.441 14.816 < 2e-16 ***
## Ads1 41.571 9.630 4.317 0.000118 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.52 on 36 degrees of freedom
## Multiple R-squared: 0.3411, Adjusted R-squared: 0.3228
## F-statistic: 18.64 on 1 and 36 DF, p-value: 0.0001184
#Alternatively, you can also use the factor function within the lm function, saving the step of creating the factor variable first.
summary(lm(Purchase~ factor(Ads), data))
##
## Call:
## lm(formula = Purchase ~ factor(Ads), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.000 -23.250 3.071 22.643 51.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.429 6.441 14.816 < 2e-16 ***
## factor(Ads)1 41.571 9.630 4.317 0.000118 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.52 on 36 degrees of freedom
## Multiple R-squared: 0.3411, Adjusted R-squared: 0.3228
## F-statistic: 18.64 on 1 and 36 DF, p-value: 0.0001184
Q1: One of the referenced articles is “Understanding R programming over Excel for Data Analysis” from Gap Intelligence. At first, I thought Excel was a sufficient tool for data analysis, as it provides built-in functionalities like pivot tables and regression analysis. But now I think that R is a much more powerful tool for handling large datasets and automating repetitive tasks. The article explains how Excel struggles with large datasets, while R can efficiently process millions of records. It also highlights the advantage of R Markdown for documenting and replicating analyses. This has changed my perspective on using R for more scalable and reproducible data analysis.
Q2: Regression Results (R Analysis) After performing the regression in R, we obtained the following key metrics:
P-value: If the P-value is less than 0.05, we reject the null hypothesis and conclude that advertising exposure has a statistically significant impact on purchase behavior. R-squared: This value tells us how well the independent variable (Ads) explains the variance in the dependent variable (Purchase). A higher R² means a better fit. Coefficients: The coefficient for Ads1 indicates the increase in purchases due to ad exposure, compared to the baseline (no ad exposure). Marketing Implications As a marketing analyst for the company, these results suggest that advertising exposure positively impacts sales. If Ads1 has a high positive coefficient, then increasing ad frequency within an optimal range could enhance sales. However, if the impact plateaus after a certain point (non-linear relationship), we should optimize ad spend to maximize ROI rather than overloading consumers with excessive ads.
Q3: Q3: Comparison of Example 2.1 and Example 2.2 Example 2.1 explicitly converts the Ads variable into a factor before performing the regression, while Example 2.2 does this conversion directly inside the lm() function. The results should be the same since both methods account for categorical variables correctly. However, I prefer Example 2.2 because it is more concise and eliminates an extra step. It simplifies the code while achieving the same statistical results.