Here’s a table with field names and descriptions. Generate 10 records for this table.
User_ID: A unique value identifying each buyer
Product_ID: A unique value identifying each product
Gender: The gender of the buyer (M or F)
Age: The buyer’s age
Occupation: The buyer’s occupation, specified as a numeric value
City_Category: The city category in which the purchase was made
Marital_Status: The marital status of the buyer. 0 denotes single, and 1
denotes married.
Purchase: The amount spent by a user for one purchase in dollars
df <- data.frame(
User_ID = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010),
Product_ID = c(101, 102, 103, 104, 105, 106, 107, 108, 109, 110),
Gender = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
Age = c(23, 34, 40, 26, 42, 29, 50, 38, 45, 31),
Occupation = c(2, 5, 3, 7, 4, 6, 2, 5, 3, 7),
City_Category = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
Marital_Status = c(0, 1, 1, 0, 1, 0, 1, 1, 1, 0),
Purchase = c(1370.5, 2365.3, 1849.2, 1420.9, 3215.6, 1768.2, 2798.1, 1935.7, 2520.4, 2186.9)
)
In City_Category, replace A by aaa, B by bbb, and C by ccc
Add 90 records
write code in R to filter Age below 40
use tidyverse to filter
create a new variable called Age2 which is Age squared
Compute the mean purchase by gender
Which occupation has the lowest average purchase?
Is purchase related to city_category?
To interpret the ANOVA results, we need to look at the F-statistic and the p-value. The F-statistic tests the null hypothesis that the mean Purchase is the same across all City_Category groups, against the alternative hypothesis that at least one of the means is different. The p-value tells us the probability of obtaining an F-statistic as extreme as the one observed, assuming the null hypothesis is true.
If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that there is a significant difference in mean Purchase between at least two of the City_Category groups. In this case, we would need to perform post-hoc tests (such as Tukey’s HSD) to determine which groups are significantly different from each other.
# Performing an ANOVA test on Purchase by City_Category
anova_result <- aov(Purchase ~ City_Category, data = df)
# Printing the ANOVA table
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## City_Category 2 580802 290401 0.788 0.491
## Residuals 7 2580552 368650
In the output of the ANOVA table, we can see that the p-value for City_Category is 0.491, which is greater than the significance level of 0.05. This suggests that we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference in mean Purchase between the City_Category groups.
Regress purchase on city category
# Fit linear regression model
model1 <- lm(Purchase ~ City_Category, data = df)
# Summarize model results
summary(model1)
##
## Call:
## lm(formula = Purchase ~ City_Category, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -573.6 -461.8 -168.5 416.6 854.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1944.1 303.6 6.404 0.000366 ***
## City_CategoryB 561.4 463.7 1.211 0.265301
## City_CategoryC 101.8 463.7 0.220 0.832452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 607.2 on 7 degrees of freedom
## Multiple R-squared: 0.1837, Adjusted R-squared: -0.0495
## F-statistic: 0.7877 on 2 and 7 DF, p-value: 0.4914
The interpretation of the coefficients will depend on the coding scheme used for the City_Category variable. In our case, we have encoded it as a categorical variable with three levels (A, B, and C). The intercept will represent the mean Purchase for the reference category (in this case, A). The coefficients for B and C will represent the difference in mean Purchase between those categories and the reference category.
In this case, the p-value for City_CategoryB is 0.265 and the p-value for City_CategoryC is 0.832. Both p-values are greater than 0.05, which suggests that there is not enough evidence to conclude that there is a significant relationship between City_Category and Purchase. Therefore, we can conclude that City_Category is not a significant predictor of Purchase.
Generate data such that city category is a significant predictor of purchase
One way to do this is to manipulate the mean Purchase for each City_Category so that they are different and there is a significant difference between them.
# Set seed for reproducibility
set.seed(123)
# Generate data
df2 <- tibble(
User_ID = 1:1000,
Product_ID = sample(1:100, 1000, replace = TRUE),
Gender = sample(c("M", "F"), 1000, replace = TRUE),
Age = sample(18:60, 1000, replace = TRUE),
Occupation = sample(1:5, 1000, replace = TRUE),
City_Category = sample(c("aaa", "bbb", "ccc"), 1000, replace = TRUE),
Marital_Status = sample(c(0, 1), 1000, replace = TRUE)
)
# Manipulate mean Purchase for each City_Category
df2 <- df2 %>%
mutate(Purchase = case_when(
City_Category == "aaa" ~ rnorm(1, 50, 10),
City_Category == "bbb" ~ rnorm(1, 60, 10),
City_Category == "ccc" ~ rnorm(1, 70, 10)
))
df2 %>%
group_by(City_Category) %>%
summarise(n = n(),
mean_purchase = mean(Purchase),
sd_purchase = sd(Purchase))
## # A tibble: 3 × 4
## City_Category n mean_purchase sd_purchase
## <chr> <int> <dbl> <dbl>
## 1 aaa 338 67.1 0
## 2 bbb 344 57.9 0
## 3 ccc 318 73.3 0
# Fit linear regression model
model2 <- lm(Purchase ~ City_Category, data = df2)
# Summarize model results
summary(model2)
##
## Call:
## lm(formula = Purchase ~ City_Category, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.110e-14 -5.110e-14 -3.600e-15 0.000e+00 1.481e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.710e+01 2.575e-14 2.606e+15 <2e-16 ***
## City_Categorybbb -9.245e+00 3.626e-14 -2.550e+14 <2e-16 ***
## City_Categoryccc 6.196e+00 3.698e-14 1.675e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.734e-13 on 997 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.957e+28 on 2 and 997 DF, p-value: < 2.2e-16
Use logistic regression to predict purchase greater than $2000 by marital status
create a binary variable indicating whether a purchase is greater than $2000.
read in data from CSV file
data <- read_csv()
data <- data %>% mutate(Purchase_Greater_Than_2000 = ifelse(Purchase > 2000, 1, 0))
fit logistic regression model
logistic_model <- glm(Purchase_Greater_Than_2000 ~ Marital_Status, data = data, family = binomial(link = “logit”))
generate predicted probabilities for different levels of marital status
new_data <- data.frame(Marital_Status = c(0, 1)) # 0 represents single, 1 represents married
new_data$predicted_prob <- predict(logistic_model, newdata = new_data, type = "response")
print predicted probabilities
new_data
print model summary
summary(logistic_model)
Deviance Residuals:
Min | 1Q | Median | 3Q | Max |
---|---|---|---|---|
-1.0013 | -0.6699 | -0.6699 | 1.1052 | 1.1052 |
Coefficients:
Term | Estimate | Std. Error | z value | Pr(>|z|) |
---|---|---|---|---|
(Intercept) | -0.5128 | 0.0217 | -23.603 | < 2e-16 *** |
Marital_Status | 0.4362 | 0.0263 | 16.591 | < 2e-16 *** |
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 550071 on 499999 degrees of freedom
Residual deviance: 542992 on 499998 degrees of freedom
AIC: 542996
Number of Fisher Scoring iterations: 4
Marital_Status | predicted_prob |
---|---|
0 | 0.0567677 |
1 | 0.1076199 |
The model summary shows that the coefficient for Marital_Status is 0.4362, indicating that the log-odds of making a purchase greater than $2000 is 0.4362 higher for a married individual compared to a single individual, holding all other variables constant.
The predicted probabilities table shows that the predicted probability of making a purchase greater than $2000 is 0.0568 for a single individual and 0.1076 for a married individual. This suggests that marital status may be a weak predictor of making a purchase greater than $2000, with married individuals having a slightly higher probability than single individuals.