Short Answer

  1. How do you decide what is in your “best” model for logistic or multiple regression if you have several independent variables?

This is a question about model selection: Forces, hierachical, and stepwise are common methods people use to select their model. We use the function anova() and look for the lowest AIC value for the hierachical model.

  1. How do you include categorical variables with more than 2 categories in linear (or multiple) regression?

If you have more than 2, you need to use the as.factor() for that variable.

  1. What are the assumptions of logistic regression? How would you test them? What about outliers and influential cases?

The assumptions to test are:

  1. Cooks distance: test for outlier or influencial points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.

  2. Standardize residuals: should be normally distributed. (rstandard(model1) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)

  3. Multicollinearity: independent variables are highly correlated with each other.vif(model1)

  4. If you have a multiple regression equation, such as:

    Y = B0 + B1x1 + B2xx

Y changes by an increase of B1

  1. What are the assumptions that need to be tested in multiple regression?

In a multiple regression, we are interested in residuals. plot(model1) function gives 4 plots.

  1. 1st plot: The residuals need to be similar across (linear relationship)

  2. 2nd Plot: Normal distribution of residuals (QQ Plot)

  3. 3rd Plot: Homogeneity of variance, the standard deviation needs to be the same across all of the values of Y.

  4. Cooks distance: test for outlier or influencial points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.

  5. Residuals need to be independent of eachother: Durbine Watson test dwt() We want the Durbine Watson statistic to be approximately 2. Values less than 1 or greater than 3 are problematic.

  6. Standardize residuals: should be normally distributed. (rstandard(model1) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)

  7. Multicollinearity: independent variables are highly correlated with each other.vif(model1)

  8. SKIP

  9. Below is some output from a model predicting whether someone progressed to their third birth (outcome is 1 = had a third birth; 0 = did not have a third birth) among women under age 45 who work full time who live with a partner. Marital status is a categorical variable (1=Married, 2= Cohabitating). This data is from an Australian dataset.

## function (..., list = character(), package = NULL, lib.loc = NULL, 
##     verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE) 
## {
##     fileExt <- function(x) {
##         db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x)
##         ans <- sub(".*\\.", "", x)
##         ans[db] <- sub(".*\\.([^.]+\\.)(gz|bz2|xz)$", "\\1\\2", 
##             x[db])
##         ans
##     }
##     my_read_table <- function(...) {
##         lcc <- Sys.getlocale("LC_COLLATE")
##         on.exit(Sys.setlocale("LC_COLLATE", lcc))
##         Sys.setlocale("LC_COLLATE", "C")
##         read.table(...)
##     }
##     stopifnot(is.character(list))
##     names <- c(as.character(substitute(list(...))[-1L]), list)
##     if (!is.null(package)) {
##         if (!is.character(package)) 
##             stop("'package' must be a character vector or NULL")
##     }
##     paths <- find.package(package, lib.loc, verbose = verbose)
##     if (is.null(lib.loc)) 
##         paths <- c(path.package(package, TRUE), if (!length(package)) getwd(), 
##             paths)
##     paths <- unique(normalizePath(paths[file.exists(paths)]))
##     paths <- paths[dir.exists(file.path(paths, "data"))]
##     dataExts <- tools:::.make_file_exts("data")
##     if (length(names) == 0L) {
##         db <- matrix(character(), nrow = 0L, ncol = 4L)
##         for (path in paths) {
##             entries <- NULL
##             packageName <- if (file_test("-f", file.path(path, 
##                 "DESCRIPTION"))) 
##                 basename(path)
##             else "."
##             if (file_test("-f", INDEX <- file.path(path, "Meta", 
##                 "data.rds"))) {
##                 entries <- readRDS(INDEX)
##             }
##             else {
##                 dataDir <- file.path(path, "data")
##                 entries <- tools::list_files_with_type(dataDir, 
##                   "data")
##                 if (length(entries)) {
##                   entries <- unique(tools::file_path_sans_ext(basename(entries)))
##                   entries <- cbind(entries, "")
##                 }
##             }
##             if (NROW(entries)) {
##                 if (is.matrix(entries) && ncol(entries) == 2L) 
##                   db <- rbind(db, cbind(packageName, dirname(path), 
##                     entries))
##                 else warning(gettextf("data index for package %s is invalid and will be ignored", 
##                   sQuote(packageName)), domain = NA, call. = FALSE)
##             }
##         }
##         colnames(db) <- c("Package", "LibPath", "Item", "Title")
##         footer <- if (missing(package)) 
##             paste0("Use ", sQuote(paste("data(package =", ".packages(all.available = TRUE))")), 
##                 "\n", "to list the data sets in all *available* packages.")
##         else NULL
##         y <- list(title = "Data sets", header = NULL, results = db, 
##             footer = footer)
##         class(y) <- "packageIQR"
##         return(y)
##     }
##     paths <- file.path(paths, "data")
##     for (name in names) {
##         found <- FALSE
##         for (p in paths) {
##             tmp_env <- if (overwrite) 
##                 envir
##             else new.env()
##             if (file_test("-f", file.path(p, "Rdata.rds"))) {
##                 rds <- readRDS(file.path(p, "Rdata.rds"))
##                 if (name %in% names(rds)) {
##                   found <- TRUE
##                   if (verbose) 
##                     message(sprintf("name=%s:\t found in Rdata.rds", 
##                       name), domain = NA)
##                   thispkg <- sub(".*/([^/]*)/data$", "\\1", p)
##                   thispkg <- sub("_.*$", "", thispkg)
##                   thispkg <- paste0("package:", thispkg)
##                   objs <- rds[[name]]
##                   lazyLoad(file.path(p, "Rdata"), envir = tmp_env, 
##                     filter = function(x) x %in% objs)
##                   break
##                 }
##                 else if (verbose) 
##                   message(sprintf("name=%s:\t NOT found in names() of Rdata.rds, i.e.,\n\t%s\n", 
##                     name, paste(names(rds), collapse = ",")), 
##                     domain = NA)
##             }
##             if (file_test("-f", file.path(p, "Rdata.zip"))) {
##                 warning("zipped data found for package ", sQuote(basename(dirname(p))), 
##                   ".\nThat is defunct, so please re-install the package.", 
##                   domain = NA)
##                 if (file_test("-f", fp <- file.path(p, "filelist"))) 
##                   files <- file.path(p, scan(fp, what = "", quiet = TRUE))
##                 else {
##                   warning(gettextf("file 'filelist' is missing for directory %s", 
##                     sQuote(p)), domain = NA)
##                   next
##                 }
##             }
##             else {
##                 files <- list.files(p, full.names = TRUE)
##             }
##             files <- files[grep(name, files, fixed = TRUE)]
##             if (length(files) > 1L) {
##                 o <- match(fileExt(files), dataExts, nomatch = 100L)
##                 paths0 <- dirname(files)
##                 paths0 <- factor(paths0, levels = unique(paths0))
##                 files <- files[order(paths0, o)]
##             }
##             if (length(files)) {
##                 for (file in files) {
##                   if (verbose) 
##                     message("name=", name, ":\t file= ...", .Platform$file.sep, 
##                       basename(file), "::\t", appendLF = FALSE, 
##                       domain = NA)
##                   ext <- fileExt(file)
##                   if (basename(file) != paste0(name, ".", ext)) 
##                     found <- FALSE
##                   else {
##                     found <- TRUE
##                     zfile <- file
##                     zipname <- file.path(dirname(file), "Rdata.zip")
##                     if (file.exists(zipname)) {
##                       Rdatadir <- tempfile("Rdata")
##                       dir.create(Rdatadir, showWarnings = FALSE)
##                       topic <- basename(file)
##                       rc <- .External(C_unzip, zipname, topic, 
##                         Rdatadir, FALSE, TRUE, FALSE, FALSE)
##                       if (rc == 0L) 
##                         zfile <- file.path(Rdatadir, topic)
##                     }
##                     if (zfile != file) 
##                       on.exit(unlink(zfile))
##                     switch(ext, R = , r = {
##                       library("utils")
##                       sys.source(zfile, chdir = TRUE, envir = tmp_env)
##                     }, RData = , rdata = , rda = load(zfile, 
##                       envir = tmp_env), TXT = , txt = , tab = , 
##                       tab.gz = , tab.bz2 = , tab.xz = , txt.gz = , 
##                       txt.bz2 = , txt.xz = assign(name, my_read_table(zfile, 
##                         header = TRUE, as.is = FALSE), envir = tmp_env), 
##                       CSV = , csv = , csv.gz = , csv.bz2 = , 
##                       csv.xz = assign(name, my_read_table(zfile, 
##                         header = TRUE, sep = ";", as.is = FALSE), 
##                         envir = tmp_env), found <- FALSE)
##                   }
##                   if (found) 
##                     break
##                 }
##                 if (verbose) 
##                   message(if (!found) 
##                     "*NOT* ", "found", domain = NA)
##             }
##             if (found) 
##                 break
##         }
##         if (!found) {
##             warning(gettextf("data set %s not found", sQuote(name)), 
##                 domain = NA)
##         }
##         else if (!overwrite) {
##             for (o in ls(envir = tmp_env, all.names = TRUE)) {
##                 if (exists(o, envir = envir, inherits = FALSE)) 
##                   warning(gettextf("an object named %s already exists and will not be overwritten", 
##                     sQuote(o)))
##                 else assign(o, get(o, envir = tmp_env, inherits = FALSE), 
##                   envir = envir)
##             }
##             rm(tmp_env)
##         }
##     }
##     invisible(names)
## }
## <bytecode: 0x000002970e94cbe8>
## <environment: namespace:utils>
  1. How do we interpret the estimate in a logistic regression model?

Partner_Income_log_transformed and Woman_Education because the values in the estimate column are positive.

Current_Age and two_Cohabitating because the values in the estimate column are negative.

  1. Which predictors are significant predictors of progressing to third birth?

When the p-value is < 0.05, then the predictor is significant. Partner_Income_log_transformed and Current_Age are significant.

  1. How do you interpret the categorical variable Marital Status?

Marital status is a categorical variable (1=Married, 2= Cohabitating). The first row, Marital_Status is empty and the second row two_ Cohabitating has values with -0.1390692 in the estimate column. This tells us that if you are cohabiting then you are less likely to have a third child compared to Married. The reference category is married. The reason there is a blank line is because it is the reference category.

What type of analysis?

Some of these may be recent analyses we have used, others might be older.

  1. An HIV researcher is interested in exploring the factors that influence condom use with a new partner. The outcome measure was whether a condom was used (Use: condom used = 1, not used = 0). The predictor variables include biological sex (male = 0, female = 1), safety (the degree to which the person views their relationship safe from sexually transmitted disease and perceive (the degree to which the person perceives a risk from unprotected sex.

With a binary dependent variable, logistic regression is an ideal method for analysis.

  1. You want to understand fertility rates (measured as live births) by examining the influence of both education and wealth.

Wealth and Education influence fertility rates. There are two independent variables and one dependent variable. Multiple regression with categorical variables is an appropriate type of analysis for this question.

  1. Tablets are very popular. A CEO was interested in how to make her brand of tablets more desirable. She collected data on how desirable people perceived a product’s advertising to be (Advert_des) and how much they wanted to buy the product (Product_desirability). Test her hypothesis that product desirability is influenced by advertising desirability.

Advertising desirability is the independent variable and product desirability is the dependent variable. This is looking at the relationship or 2 variables which is correlation / linear regression.

Data Management

You will also be responsible for the material covered in data management.

  1. If you wanted to change a variable from continuous to categorical (for instance to plot it), how might you do this? For example, if you wanted to take age and convert it to: age category (under 20, 20-29, 30-39, 40+)
# Sample of age data
df <- data.frame(Age = c(18, 22, 27, 35, 42, 45, 19, 39, 32, 25))

# Add a new column with age categories using mutate and nested ifelse
df1 <- df %>%
  mutate(Age_Category = ifelse(Age < 20, "Under 20",
                                ifelse(Age >= 20 & Age < 30, "20-29",
                                       ifelse(Age >= 30 & Age < 40, "30-39", "40+"))))

# Print the data frame with age categories
print(df1)
##    Age Age_Category
## 1   18     Under 20
## 2   22        20-29
## 3   27        20-29
## 4   35        30-39
## 5   42          40+
## 6   45          40+
## 7   19     Under 20
## 8   39        30-39
## 9   32        30-39
## 10  25        20-29
# Add a new column with age categories using mutate and case_when
df2 <- df %>%
  mutate(Age_Category = case_when(
    Age < 20 ~ "Under 20",
    Age >= 20 & Age < 30 ~ "20-29",
    Age >= 30 & Age < 40 ~ "30-39",
    Age >= 40 ~ "40+"
  ))

# Print the data frame with age categories
print(df2)
##    Age Age_Category
## 1   18     Under 20
## 2   22        20-29
## 3   27        20-29
## 4   35        30-39
## 5   42          40+
## 6   45          40+
## 7   19     Under 20
## 8   39        30-39
## 9   32        30-39
## 10  25        20-29
  1. What is a “key” variable when you try to merge datasets?

The key is a column in both datasets that allows the rows to be merged by matching to the key. Often times, this is a kind of unique identifier.

Example Problems

library(readr)
Condom <- read_csv("Condom.csv")
head(Condom)
## # A tibble: 6 × 8
##   particip safety   use   sex sexexp previous selfcon perceive
##      <dbl>  <dbl> <dbl> <dbl>  <dbl>    <dbl>   <dbl>    <dbl>
## 1        5      3     0     1      5        0       5        4
## 2        6      1     0     0      3        0       2        2
## 3        9      0     0     1      2        0       3        0
## 4       13      3     0     0      3        0       4        4
## 5       14      2     0     1      3        0       6        3
## 6       18      0     0     1      8        0       5        1

1. Condom.csv

An HIV researcher is interested in exploring the factors that influence condom use with a new partner. The outcome measure was whether a condom was used (use: condom used = 1, not used = 0). The predictor variables include sex (male = 0, female = 1), safety (the degree to which the person views their relationship safe from sexually transmitted disease) and perceive (the degree to which the person perceives a risk from unprotected sex). Data can be found in Condom.csv.

With an outcome variable as binary and multiple continuous or ordinal variables, logistic regression is an appropriate method for analysis.

a. Run a model

Run a model to determine whether sex, safety and perceive predict condom use. Please include safety and perceive as continuous/scale variables. Comment on the results.

# Please include `safety` and `perceive` as continuous/scale variables.
Condom$safety <- as.numeric(Condom$safety)
Condom$perceive <- as.numeric(Condom$perceive)
model1 <- glm(use ~ sex + safety + perceive, family = binomial, data=Condom)
summary(model1)
## 
## Call:
## glm(formula = use ~ sex + safety + perceive, family = binomial, 
##     data = Condom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3567  -0.8832  -0.3228   1.0255   2.0523  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4761     0.7517  -3.294 0.000988 ***
## sex           0.3167     0.4963   0.638 0.523361    
## safety       -0.4641     0.2178  -2.131 0.033099 *  
## perceive      0.9402     0.2230   4.217 2.48e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 105.77  on 96  degrees of freedom
## AIC: 113.77
## 
## Number of Fisher Scoring iterations: 5

Perceive is a significant predictor of condom use. The p-value associated with perceive is 2.48e-05 (indicated by ***), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person perceives a risk from unprotected sex and the likelihood of using a condom.

Safety is a significant predictor of condom use. The p-value associated with safety is 0.033099 (indicated by *), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person views their relationship safe from sexually transmitted disease and the likelihood of using a condom.

Sex is not a significant predictor of condom use. The p-value associated with sex is 0.523361 (indicated by not having a *), which is greater than the significance level of 0.05. This suggests that there is not a statistically significant relationship between males and females and the likelihood of using a condom.

b. Determine the best model

Determine the best model and use this model to answer the remaining questions in this problem.

To test if one model is significantly better than another, I create other models for comparison.

model2 <- glm(use ~ safety + perceive, family = binomial, data=Condom)
summary(model2)
## 
## Call:
## glm(formula = use ~ safety + perceive, family = binomial, data = Condom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4046  -0.8443  -0.3500   0.9823   2.0101  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2230     0.6267  -3.547 0.000389 ***
## safety       -0.4906     0.2157  -2.275 0.022935 *  
## perceive      0.9327     0.2228   4.186 2.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 106.18  on 97  degrees of freedom
## AIC: 112.18
## 
## Number of Fisher Scoring iterations: 5

Perceive is a significant predictor of condom use. The p-value associated with perceive is 2.84e-05 (indicated by ***), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person perceives a risk from unprotected sex and the likelihood of using a condom.

Safety is a significant predictor of condom use. The p-value associated with safety is 0.022935 (indicated by *), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person views their relationship safe from sexually transmitted disease and the likelihood of using a condom.

Then I test if this model is significantly better.

anova(model1, model2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: use ~ sex + safety + perceive
## Model 2: use ~ safety + perceive
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        96     105.77                     
## 2        97     106.18 -1 -0.41001    0.522

test="Chisq" tells R that that I am using a binary outcome.

Here’s an interpretation of the output:

Resid. Df: The residual degrees of freedom for Model 1 and Model 2 are 96 and 97, respectively.

Resid. Dev: The residual deviance for Model 1 (105.77) and Model 2 (106.18) measures the goodness-of-fit of each model. Lower values indicate a better fit.

Df: The difference in degrees of freedom between the two models (96 - 97) is -1. This is because Model 2 has one less predictor variable (sex) compared to Model 1.

Deviance: The difference in residual deviance between the two models is -0.41. This value represents the reduction in residual deviance when moving from Model 1 to Model 2.

Pr(>Chi): The p-value for the test is 0.522, which is large (greater than than 0.05). This suggests that there is a not statistically significant difference between the two models, and Model 2 (without sex as an additional predictor variable) does not provide a significantly better fit to the data than Model 1.

Based on the analysis, Model 2 does not provide a significantly better fit to the data compared to Model 1, as the p-value (0.522) is greater than the significance level of 0.05. Therefore, adding the sex variable to the model does not lead to a significant improvement in model fit.

In this situation, I consider using the simpler model (Model 2), as it has fewer predictor variables while still providing a similar level of fit to the data. Additionally, in Model 2, both safety and perceive are significant predictors of condom use, suggesting that these variables have a statistically significant relationship with the likelihood of using a condom.

c. Interpret sex

How does biological sex influence condom use? Be sure to interpret the “direction” of the finding (which sex has a higher likelihood of use).

While sex is not statistically significant, the estimate value of 0.3167 suggests that the biological sex “female” is more likely to use a condom than “male”. I know this because “female” is coded at a higher value of 1 compared to the coded value of 0 for males. The estimate is positive which indicates the direction of the relationship. The observed difference in condom use between males and females might be due to chance, and therefore, I cannot make a claim about the influence of biological sex on condom use based on this model alone.

d. Please check the assumptions

Please check the assumptions of your model and comment on them. Hint: When you test linearity of the logit, take the natural logarithm of the variable plus 1.

The assumptions to test are:

  1. Cooks distance: test for outlier or influencial points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.

  2. Standardize residuals: should be normally distributed. (rstandard(model1)) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)

  3. Multicollinearity: independent variables are highly correlated with each other.vif(model1) Less than 2.5 is great! Real problem if you are over 10.

max(cooks.distance(model1))
## [1] 0.1361888
n<- length(model1)
n
## [1] 30
sum(rstandard(model1)>3)/n
## [1] 0
library(car)
vif(model1)
##      sex   safety perceive 
## 1.100105 1.807002 1.764291

The assumptions have been met for model 1.

e. Interpret the results of your model.

Be sure to mention the odds ratio associated with each predictor.

I calculate the odds ratio.

exp(coef(model1))
## (Intercept)         sex      safety    perceive 
##  0.08407051  1.37260745  0.62867995  2.56047130

The value for the independent variables are the odds. A value greater than 1 means that as the predictor increases, the odds of the outcome also increases. A value less than 1 means that as the predictor increases, the odds of the outcome occurring decreases. Near one means it is not a great predictor outcome variable.

  • The odds ratio for sex is 1.37260745. This means that the odds of using a condom for females are about 1.37 times higher than for males, keeping all other variables constant. However, the sex variable was not statistically significant in Model 1, so I cannot confidently claim that there is a significant relationship between biological sex and condom use.

  • The odds ratio for safety is 0.62867995. This means that for each one-unit increase in the safety score, the odds of using a condom decrease by about 37.1% (1 - 0.62867995), keeping all other variables constant. Since the p-value for safety is significant in Model 1, I conclude that there is a statistically significant relationship between the safety perception and the likelihood of using a condom.

  • The odds ratio for perceive is 2.56047130. This means that for each one-unit increase in the perceived risk score, the odds of using a condom increase by about 156% (2.56047130 - 1), keeping all other variables constant. As the p-value for perceive is significant in Model 1, I conclude that there is a statistically significant relationship between the perceived risk of unprotected sex and the likelihood of using a condom.

I calculate Pseudo-R2 or McFadden’s R2

library(pscl)
## Warning: package 'pscl' was built under R version 4.2.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model1)["McFadden"]
## fitting null model for pseudo-r2
##  McFadden 
## 0.2260487

McFadden’s R2 value (0.226) indicates that the model with the sex, safety, perceive explains the 22.6% of the variation compared to the null model.

Stat Communication Block

A logistic regression was conducted to determine factors that influence condom use with a new partner. The results reveal that perceived risk and safety perception are significant predictors of condom use (p < 0.05). For each one-unit increase in perceived risk (the degree to which the person perceives a risk from unprotected sex), the odds of using a condom increase by about 156% (Odds Ratio = 2.5605), while for each one-unit increase in safety perception the degree to which the person views their relationship safe from sexually transmitted disease), the odds of using a condom decrease by about 37.1% (Odds Ratio = 0.6287). However, biological sex (male or female) is not a significant predictor of condom use (p = 0.523361), and the odds of using a condom for females are about 1.37 times higher than for males, although this difference is not statistically significant The McFadden’s R-squared value for the model is 0.226, indicating that the model with the sex, safety perception, and perceived risk variables explains 22.6% of the variation in condom use compared to the null model. The McFadden’s R-squared value for the model is 0.226, indicating that the model with the sex, safety perception, and perceived risk variables explains 22.6% of the variation in condom use compared to the null model.

2. Tablets.csv

Tablets are very popular. A CEO was interested in how to make her brand of tablets more desirable. She collected data on how cool people perceived a product’s advertising to be (Advert_cool), how cool they thought the product was (Product_cool), and how much they wanted to buy the product (Desirability). Test her hypothesis that cool advertising and cool product both influence a product’s desirability. Data can be found in Tablets.csv.

Tablets <- read_csv("Tablets.csv")
head(Tablets)
## # A tibble: 6 × 4
##      ID Advert_Cool Desirability Product_Cool
##   <dbl>       <dbl>        <dbl>        <dbl>
## 1     1           2            4            0
## 2     2           2            3            2
## 3     3           3            5            1
## 4     4           2            3           -1
## 5     5           2            4            1
## 6     6           2            3            0
sapply(Tablets, length)
##           ID  Advert_Cool Desirability Product_Cool 
##          240          240          240          240
sapply(Tablets, class)
##           ID  Advert_Cool Desirability Product_Cool 
##    "numeric"    "numeric"    "numeric"    "numeric"

a. Create scatter plots

Create scatter plots to examine the relationships between these variables.

pairs(Tablets)

To run a regression model, we assume that the relationship between the independent and dependent variables are linear.

Not really linear due to variable measurements.

b. Which analyses

Which analyses should you run to test this hypothesis?

Multiple regression

c. Run analysis

Run the appropriate statistical analysis.

# Multiple regression model
model1 <- lm(Desirability ~ Advert_Cool + Product_Cool, data = Tablets)
summary(model1)
## 
## Call:
## lm(formula = Desirability ~ Advert_Cool + Product_Cool, data = Tablets)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.94937 -0.51533  0.05063  0.48467  1.80271 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.14305    0.13756  22.849  < 2e-16 ***
## Advert_Cool   0.18614    0.05959   3.124  0.00201 ** 
## Product_Cool  0.24791    0.05670   4.372 1.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7053 on 237 degrees of freedom
## Multiple R-squared:  0.1297, Adjusted R-squared:  0.1223 
## F-statistic: 17.65 on 2 and 237 DF,  p-value: 7.127e-08
Which (if any) variables significantly predict Desirability?

The results of the multiple regression model suggest that both Advert_Cool and Product_Cool are significant predictors of Desirability.

  • For Advert_Cool, the estimated coefficient is 0.18614, with a standard error of 0.05959, a t-value of 3.124, and a p-value of 0.00201. Since the p-value is less than the significance level of 0.05, we can conclude that Advert_Cool significantly predicts Desirability.

  • For Product_Cool, the estimated coefficient is 0.24791, with a standard error of 0.05670, a t-value of 4.372, and a p-value of 1.84e-05. As the p-value is also less than the significance level of 0.05, we can conclude that Product_Cool significantly predicts Desirability.

d. Assumptions

What are the assumptions of your model? Test them.

Assumptions: In a multiple regression, we are interested in residuals.

plot(model1) function gives 4 plots.

plot(model1)

  1. 1st plot: The residuals need to be similar across (linear relationship)

TRUE

  1. 2nd Plot: Normal distribution of residuals (QQ Plot)

TRUE enough

  1. 3rd Plot: Homogeneity of variance, the standard deviation needs to be the same across all of the values of Y.

TRUE enough

  1. Cooks distance: test for outlier or influencial points. max(cooks.distance(model1)) gives the highest value. Cook’s > 1 is problematic.
max(cooks.distance(model1))
## [1] 0.08821221
  1. Residuals need to be independent of eachother: Durbine Watson test dwt() I want the Durbine Watson statistic to be approximately 2. Values less than 1 or greater than 3 are problematic.
dwt(model1)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02263121      2.041017   0.722
##  Alternative hypothesis: rho != 0

2.04 is good.

  1. Standardize residuals: should be normally distributed. plot(rstandard(model1)) indicates that a data point is really far away from what is predicted. Greater than 3 could be problematic if you have a small dataset or more than 1% of the data greater than 3. sum(rstandard(model1)>3)/n where “n” is the length of the model.
plot(rstandard(model1))

fine

  1. Multicollinearity: independent variables are highly correlated with each other.vif(model1)
vif(model1)
##  Advert_Cool Product_Cool 
##     1.037188     1.037188

Both values are less than 2.5, therefore I conclude that the predictors are not highly correlated with each other.

Are there any particularly influential points?

Cooks distance value was well below 1.

e. Effect on Desirability

As Advert_cool increases by 1, what is the effect on Desirability?

y = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ + ε

Desirability = β₀ + β₁(Advert_Cool) + β₂(Product_Cool) + ε

As Advert_Cool increases by 1, Desirability increases by β₁.

f. Summarize

Summarize your findings in light of the original problem.

The CEO was interested in determining whether cool advertising (Advert_Cool) and a cool product (Product_Cool) influence a product’s desirability. Based on the multiple regression analysis, both Advert_Cool and Product_Cool were found to be significant predictors of desirability (p < 0.05). The analysis shows that for each one-unit increase in the coolness of advertising, the desirability score increases by 0.186. Similarly, for each one-unit increase in the coolness of the product, the desirability score increases by 0.248. The adjusted R-squared value is 0.1223, which means that the model explains approximately 12.23% of the variation in desirability.