This is a question about model selection: Forces, hierachical, and
stepwise are common methods people use to select their model. We use the
function anova() and look for the lowest AIC value for the
hierachical model.
If you have more than 2, you need to use the as.factor()
for that variable.
The assumptions to test are:
Cooks distance: test for outlier or influencial points.
max(cooks.distance(model1)) gives the highest value. Cook’s
> 1 is problematic.
Standardize residuals: should be normally distributed.
(rstandard(model1) indicates that a data point is really
far away from what is predicted. Greater than 3 could be problematic if
you have a small dataset or more than 1% of the data greater than 3.
sum(rstandard(model1)>3)
Multicollinearity: independent variables are highly correlated
with each other.vif(model1)
If you have a multiple regression equation, such as:
Y = B0 + B1x1 + B2xx
Y changes by an increase of B1
In a multiple regression, we are interested in residuals.
plot(model1) function gives 4 plots.
1st plot: The residuals need to be similar across (linear relationship)
2nd Plot: Normal distribution of residuals (QQ Plot)
3rd Plot: Homogeneity of variance, the standard deviation needs to be the same across all of the values of Y.
Cooks distance: test for outlier or influencial points.
max(cooks.distance(model1)) gives the highest value. Cook’s
> 1 is problematic.
Residuals need to be independent of eachother: Durbine Watson
test dwt() We want the Durbine Watson statistic to be
approximately 2. Values less than 1 or greater than 3 are
problematic.
Standardize residuals: should be normally distributed.
(rstandard(model1) indicates that a data point is really
far away from what is predicted. Greater than 3 could be problematic if
you have a small dataset or more than 1% of the data greater than 3.
sum(rstandard(model1)>3)
Multicollinearity: independent variables are highly correlated
with each other.vif(model1)
SKIP
Below is some output from a model predicting whether someone progressed to their third birth (outcome is 1 = had a third birth; 0 = did not have a third birth) among women under age 45 who work full time who live with a partner. Marital status is a categorical variable (1=Married, 2= Cohabitating). This data is from an Australian dataset.
## function (..., list = character(), package = NULL, lib.loc = NULL,
## verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
## {
## fileExt <- function(x) {
## db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x)
## ans <- sub(".*\\.", "", x)
## ans[db] <- sub(".*\\.([^.]+\\.)(gz|bz2|xz)$", "\\1\\2",
## x[db])
## ans
## }
## my_read_table <- function(...) {
## lcc <- Sys.getlocale("LC_COLLATE")
## on.exit(Sys.setlocale("LC_COLLATE", lcc))
## Sys.setlocale("LC_COLLATE", "C")
## read.table(...)
## }
## stopifnot(is.character(list))
## names <- c(as.character(substitute(list(...))[-1L]), list)
## if (!is.null(package)) {
## if (!is.character(package))
## stop("'package' must be a character vector or NULL")
## }
## paths <- find.package(package, lib.loc, verbose = verbose)
## if (is.null(lib.loc))
## paths <- c(path.package(package, TRUE), if (!length(package)) getwd(),
## paths)
## paths <- unique(normalizePath(paths[file.exists(paths)]))
## paths <- paths[dir.exists(file.path(paths, "data"))]
## dataExts <- tools:::.make_file_exts("data")
## if (length(names) == 0L) {
## db <- matrix(character(), nrow = 0L, ncol = 4L)
## for (path in paths) {
## entries <- NULL
## packageName <- if (file_test("-f", file.path(path,
## "DESCRIPTION")))
## basename(path)
## else "."
## if (file_test("-f", INDEX <- file.path(path, "Meta",
## "data.rds"))) {
## entries <- readRDS(INDEX)
## }
## else {
## dataDir <- file.path(path, "data")
## entries <- tools::list_files_with_type(dataDir,
## "data")
## if (length(entries)) {
## entries <- unique(tools::file_path_sans_ext(basename(entries)))
## entries <- cbind(entries, "")
## }
## }
## if (NROW(entries)) {
## if (is.matrix(entries) && ncol(entries) == 2L)
## db <- rbind(db, cbind(packageName, dirname(path),
## entries))
## else warning(gettextf("data index for package %s is invalid and will be ignored",
## sQuote(packageName)), domain = NA, call. = FALSE)
## }
## }
## colnames(db) <- c("Package", "LibPath", "Item", "Title")
## footer <- if (missing(package))
## paste0("Use ", sQuote(paste("data(package =", ".packages(all.available = TRUE))")),
## "\n", "to list the data sets in all *available* packages.")
## else NULL
## y <- list(title = "Data sets", header = NULL, results = db,
## footer = footer)
## class(y) <- "packageIQR"
## return(y)
## }
## paths <- file.path(paths, "data")
## for (name in names) {
## found <- FALSE
## for (p in paths) {
## tmp_env <- if (overwrite)
## envir
## else new.env()
## if (file_test("-f", file.path(p, "Rdata.rds"))) {
## rds <- readRDS(file.path(p, "Rdata.rds"))
## if (name %in% names(rds)) {
## found <- TRUE
## if (verbose)
## message(sprintf("name=%s:\t found in Rdata.rds",
## name), domain = NA)
## thispkg <- sub(".*/([^/]*)/data$", "\\1", p)
## thispkg <- sub("_.*$", "", thispkg)
## thispkg <- paste0("package:", thispkg)
## objs <- rds[[name]]
## lazyLoad(file.path(p, "Rdata"), envir = tmp_env,
## filter = function(x) x %in% objs)
## break
## }
## else if (verbose)
## message(sprintf("name=%s:\t NOT found in names() of Rdata.rds, i.e.,\n\t%s\n",
## name, paste(names(rds), collapse = ",")),
## domain = NA)
## }
## if (file_test("-f", file.path(p, "Rdata.zip"))) {
## warning("zipped data found for package ", sQuote(basename(dirname(p))),
## ".\nThat is defunct, so please re-install the package.",
## domain = NA)
## if (file_test("-f", fp <- file.path(p, "filelist")))
## files <- file.path(p, scan(fp, what = "", quiet = TRUE))
## else {
## warning(gettextf("file 'filelist' is missing for directory %s",
## sQuote(p)), domain = NA)
## next
## }
## }
## else {
## files <- list.files(p, full.names = TRUE)
## }
## files <- files[grep(name, files, fixed = TRUE)]
## if (length(files) > 1L) {
## o <- match(fileExt(files), dataExts, nomatch = 100L)
## paths0 <- dirname(files)
## paths0 <- factor(paths0, levels = unique(paths0))
## files <- files[order(paths0, o)]
## }
## if (length(files)) {
## for (file in files) {
## if (verbose)
## message("name=", name, ":\t file= ...", .Platform$file.sep,
## basename(file), "::\t", appendLF = FALSE,
## domain = NA)
## ext <- fileExt(file)
## if (basename(file) != paste0(name, ".", ext))
## found <- FALSE
## else {
## found <- TRUE
## zfile <- file
## zipname <- file.path(dirname(file), "Rdata.zip")
## if (file.exists(zipname)) {
## Rdatadir <- tempfile("Rdata")
## dir.create(Rdatadir, showWarnings = FALSE)
## topic <- basename(file)
## rc <- .External(C_unzip, zipname, topic,
## Rdatadir, FALSE, TRUE, FALSE, FALSE)
## if (rc == 0L)
## zfile <- file.path(Rdatadir, topic)
## }
## if (zfile != file)
## on.exit(unlink(zfile))
## switch(ext, R = , r = {
## library("utils")
## sys.source(zfile, chdir = TRUE, envir = tmp_env)
## }, RData = , rdata = , rda = load(zfile,
## envir = tmp_env), TXT = , txt = , tab = ,
## tab.gz = , tab.bz2 = , tab.xz = , txt.gz = ,
## txt.bz2 = , txt.xz = assign(name, my_read_table(zfile,
## header = TRUE, as.is = FALSE), envir = tmp_env),
## CSV = , csv = , csv.gz = , csv.bz2 = ,
## csv.xz = assign(name, my_read_table(zfile,
## header = TRUE, sep = ";", as.is = FALSE),
## envir = tmp_env), found <- FALSE)
## }
## if (found)
## break
## }
## if (verbose)
## message(if (!found)
## "*NOT* ", "found", domain = NA)
## }
## if (found)
## break
## }
## if (!found) {
## warning(gettextf("data set %s not found", sQuote(name)),
## domain = NA)
## }
## else if (!overwrite) {
## for (o in ls(envir = tmp_env, all.names = TRUE)) {
## if (exists(o, envir = envir, inherits = FALSE))
## warning(gettextf("an object named %s already exists and will not be overwritten",
## sQuote(o)))
## else assign(o, get(o, envir = tmp_env, inherits = FALSE),
## envir = envir)
## }
## rm(tmp_env)
## }
## }
## invisible(names)
## }
## <bytecode: 0x000002970e94cbe8>
## <environment: namespace:utils>
Partner_Income_log_transformed and Woman_Education because the values in the estimate column are positive.
Current_Age and two_Cohabitating because the values in the estimate column are negative.
When the p-value is < 0.05, then the predictor is significant. Partner_Income_log_transformed and Current_Age are significant.
Marital status is a categorical variable (1=Married, 2= Cohabitating). The first row, Marital_Status is empty and the second row two_ Cohabitating has values with -0.1390692 in the estimate column. This tells us that if you are cohabiting then you are less likely to have a third child compared to Married. The reference category is married. The reason there is a blank line is because it is the reference category.
Some of these may be recent analyses we have used, others might be older.
With a binary dependent variable, logistic regression is an ideal method for analysis.
Wealth and Education influence fertility rates. There are two independent variables and one dependent variable. Multiple regression with categorical variables is an appropriate type of analysis for this question.
Advertising desirability is the independent variable and product desirability is the dependent variable. This is looking at the relationship or 2 variables which is correlation / linear regression.
You will also be responsible for the material covered in data management.
# Sample of age data
df <- data.frame(Age = c(18, 22, 27, 35, 42, 45, 19, 39, 32, 25))
# Add a new column with age categories using mutate and nested ifelse
df1 <- df %>%
mutate(Age_Category = ifelse(Age < 20, "Under 20",
ifelse(Age >= 20 & Age < 30, "20-29",
ifelse(Age >= 30 & Age < 40, "30-39", "40+"))))
# Print the data frame with age categories
print(df1)
## Age Age_Category
## 1 18 Under 20
## 2 22 20-29
## 3 27 20-29
## 4 35 30-39
## 5 42 40+
## 6 45 40+
## 7 19 Under 20
## 8 39 30-39
## 9 32 30-39
## 10 25 20-29
# Add a new column with age categories using mutate and case_when
df2 <- df %>%
mutate(Age_Category = case_when(
Age < 20 ~ "Under 20",
Age >= 20 & Age < 30 ~ "20-29",
Age >= 30 & Age < 40 ~ "30-39",
Age >= 40 ~ "40+"
))
# Print the data frame with age categories
print(df2)
## Age Age_Category
## 1 18 Under 20
## 2 22 20-29
## 3 27 20-29
## 4 35 30-39
## 5 42 40+
## 6 45 40+
## 7 19 Under 20
## 8 39 30-39
## 9 32 30-39
## 10 25 20-29
The key is a column in both datasets that allows the rows to be merged by matching to the key. Often times, this is a kind of unique identifier.
library(readr)
Condom <- read_csv("Condom.csv")
head(Condom)
## # A tibble: 6 × 8
## particip safety use sex sexexp previous selfcon perceive
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 3 0 1 5 0 5 4
## 2 6 1 0 0 3 0 2 2
## 3 9 0 0 1 2 0 3 0
## 4 13 3 0 0 3 0 4 4
## 5 14 2 0 1 3 0 6 3
## 6 18 0 0 1 8 0 5 1
An HIV researcher is interested in exploring the factors that
influence condom use with a new partner. The outcome measure
was whether a condom was used (use: condom used = 1, not
used = 0). The predictor variables include sex
(male = 0, female = 1), safety (the degree to which the
person views their relationship safe from sexually transmitted disease)
and perceive (the degree to which the person perceives a
risk from unprotected sex). Data can be found in
Condom.csv.
With an outcome variable as binary and multiple continuous or ordinal variables, logistic regression is an appropriate method for analysis.
Run a model to determine whether sex,
safety and perceive predict condom
use. Please include safety and
perceive as continuous/scale variables. Comment on the
results.
# Please include `safety` and `perceive` as continuous/scale variables.
Condom$safety <- as.numeric(Condom$safety)
Condom$perceive <- as.numeric(Condom$perceive)
model1 <- glm(use ~ sex + safety + perceive, family = binomial, data=Condom)
summary(model1)
##
## Call:
## glm(formula = use ~ sex + safety + perceive, family = binomial,
## data = Condom)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3567 -0.8832 -0.3228 1.0255 2.0523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4761 0.7517 -3.294 0.000988 ***
## sex 0.3167 0.4963 0.638 0.523361
## safety -0.4641 0.2178 -2.131 0.033099 *
## perceive 0.9402 0.2230 4.217 2.48e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 105.77 on 96 degrees of freedom
## AIC: 113.77
##
## Number of Fisher Scoring iterations: 5
Perceive is a significant predictor of condom use. The p-value associated with perceive is 2.48e-05 (indicated by ***), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person perceives a risk from unprotected sex and the likelihood of using a condom.
Safety is a significant predictor of condom use. The p-value associated with safety is 0.033099 (indicated by *), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person views their relationship safe from sexually transmitted disease and the likelihood of using a condom.
Sex is not a significant predictor of condom use. The p-value associated with sex is 0.523361 (indicated by not having a *), which is greater than the significance level of 0.05. This suggests that there is not a statistically significant relationship between males and females and the likelihood of using a condom.
Determine the best model and use this model to answer the remaining questions in this problem.
To test if one model is significantly better than another, I create other models for comparison.
model2 <- glm(use ~ safety + perceive, family = binomial, data=Condom)
summary(model2)
##
## Call:
## glm(formula = use ~ safety + perceive, family = binomial, data = Condom)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4046 -0.8443 -0.3500 0.9823 2.0101
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2230 0.6267 -3.547 0.000389 ***
## safety -0.4906 0.2157 -2.275 0.022935 *
## perceive 0.9327 0.2228 4.186 2.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 106.18 on 97 degrees of freedom
## AIC: 112.18
##
## Number of Fisher Scoring iterations: 5
Perceive is a significant predictor of condom use. The p-value associated with perceive is 2.84e-05 (indicated by ***), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person perceives a risk from unprotected sex and the likelihood of using a condom.
Safety is a significant predictor of condom use. The p-value associated with safety is 0.022935 (indicated by *), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between the degree to which the person views their relationship safe from sexually transmitted disease and the likelihood of using a condom.
Then I test if this model is significantly better.
anova(model1, model2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: use ~ sex + safety + perceive
## Model 2: use ~ safety + perceive
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 96 105.77
## 2 97 106.18 -1 -0.41001 0.522
test="Chisq" tells R that that I am using a binary
outcome.
Here’s an interpretation of the output:
Resid. Df: The residual degrees of freedom for Model 1 and Model 2 are 96 and 97, respectively.
Resid. Dev: The residual deviance for Model 1 (105.77) and Model 2 (106.18) measures the goodness-of-fit of each model. Lower values indicate a better fit.
Df: The difference in degrees of freedom between the two models (96 - 97) is -1. This is because Model 2 has one less predictor variable (sex) compared to Model 1.
Deviance: The difference in residual deviance between the two models is -0.41. This value represents the reduction in residual deviance when moving from Model 1 to Model 2.
Pr(>Chi): The p-value for the test is 0.522, which is large
(greater than than 0.05). This suggests that there is a not
statistically significant difference between the two models, and Model 2
(without sex as an additional predictor variable) does not
provide a significantly better fit to the data than Model 1.
Based on the analysis, Model 2 does not provide a significantly
better fit to the data compared to Model 1, as the p-value (0.522) is
greater than the significance level of 0.05. Therefore, adding the
sex variable to the model does not lead to a significant
improvement in model fit.
In this situation, I consider using the simpler model (Model 2), as
it has fewer predictor variables while still providing a similar level
of fit to the data. Additionally, in Model 2, both safety
and perceive are significant predictors of condom
use, suggesting that these variables have a statistically
significant relationship with the likelihood of using a condom.
sexHow does biological sex influence condom use? Be sure to interpret the “direction” of the finding (which sex has a higher likelihood of use).
While sex is not statistically significant, the estimate
value of 0.3167 suggests that the biological sex “female” is more likely
to use a condom than “male”. I know this because “female” is coded at a
higher value of 1 compared to the coded value of 0 for males. The
estimate is positive which indicates the direction of the relationship.
The observed difference in condom use between males and females might be
due to chance, and therefore, I cannot make a claim about the influence
of biological sex on condom use based on this model alone.
Please check the assumptions of your model and comment on them. Hint: When you test linearity of the logit, take the natural logarithm of the variable plus 1.
The assumptions to test are:
Cooks distance: test for outlier or influencial points.
max(cooks.distance(model1)) gives the highest value. Cook’s
> 1 is problematic.
Standardize residuals: should be normally distributed.
(rstandard(model1)) indicates that a data point is really
far away from what is predicted. Greater than 3 could be problematic if
you have a small dataset or more than 1% of the data greater than 3.
sum(rstandard(model1)>3)
Multicollinearity: independent variables are highly correlated
with each other.vif(model1) Less than 2.5 is great! Real
problem if you are over 10.
max(cooks.distance(model1))
## [1] 0.1361888
n<- length(model1)
n
## [1] 30
sum(rstandard(model1)>3)/n
## [1] 0
library(car)
vif(model1)
## sex safety perceive
## 1.100105 1.807002 1.764291
The assumptions have been met for model 1.
Be sure to mention the odds ratio associated with each predictor.
I calculate the odds ratio.
exp(coef(model1))
## (Intercept) sex safety perceive
## 0.08407051 1.37260745 0.62867995 2.56047130
The value for the independent variables are the odds. A value greater than 1 means that as the predictor increases, the odds of the outcome also increases. A value less than 1 means that as the predictor increases, the odds of the outcome occurring decreases. Near one means it is not a great predictor outcome variable.
The odds ratio for sex is 1.37260745. This means
that the odds of using a condom for females are about 1.37 times higher
than for males, keeping all other variables constant. However, the
sex variable was not statistically significant in Model 1,
so I cannot confidently claim that there is a significant relationship
between biological sex and condom use.
The odds ratio for safety is 0.62867995. This means
that for each one-unit increase in the safety score, the
odds of using a condom decrease by about 37.1% (1 - 0.62867995), keeping
all other variables constant. Since the p-value for safety is
significant in Model 1, I conclude that there is a statistically
significant relationship between the safety perception and the
likelihood of using a condom.
The odds ratio for perceive is 2.56047130. This
means that for each one-unit increase in the perceived risk
score, the odds of using a condom increase by about 156% (2.56047130 -
1), keeping all other variables constant. As the p-value for perceive is
significant in Model 1, I conclude that there is a statistically
significant relationship between the perceived risk of unprotected sex
and the likelihood of using a condom.
I calculate Pseudo-R2 or McFadden’s R2
library(pscl)
## Warning: package 'pscl' was built under R version 4.2.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model1)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.2260487
McFadden’s R2 value (0.226) indicates that the model with
the sex, safety, perceive
explains the 22.6% of the variation compared to the null model.
A logistic regression was conducted to determine factors that influence condom use with a new partner. The results reveal that perceived risk and safety perception are significant predictors of condom use (p < 0.05). For each one-unit increase in perceived risk (the degree to which the person perceives a risk from unprotected sex), the odds of using a condom increase by about 156% (Odds Ratio = 2.5605), while for each one-unit increase in safety perception the degree to which the person views their relationship safe from sexually transmitted disease), the odds of using a condom decrease by about 37.1% (Odds Ratio = 0.6287). However, biological sex (male or female) is not a significant predictor of condom use (p = 0.523361), and the odds of using a condom for females are about 1.37 times higher than for males, although this difference is not statistically significant The McFadden’s R-squared value for the model is 0.226, indicating that the model with the sex, safety perception, and perceived risk variables explains 22.6% of the variation in condom use compared to the null model. The McFadden’s R-squared value for the model is 0.226, indicating that the model with the sex, safety perception, and perceived risk variables explains 22.6% of the variation in condom use compared to the null model.
Tablets.csvTablets are very popular. A CEO was interested in how to make her
brand of tablets more desirable. She collected data on how cool people
perceived a product’s advertising to be (Advert_cool), how
cool they thought the product was (Product_cool), and how
much they wanted to buy the product (Desirability). Test
her hypothesis that cool advertising and cool product both influence a
product’s desirability. Data can be found in
Tablets.csv.
Tablets <- read_csv("Tablets.csv")
head(Tablets)
## # A tibble: 6 × 4
## ID Advert_Cool Desirability Product_Cool
## <dbl> <dbl> <dbl> <dbl>
## 1 1 2 4 0
## 2 2 2 3 2
## 3 3 3 5 1
## 4 4 2 3 -1
## 5 5 2 4 1
## 6 6 2 3 0
sapply(Tablets, length)
## ID Advert_Cool Desirability Product_Cool
## 240 240 240 240
sapply(Tablets, class)
## ID Advert_Cool Desirability Product_Cool
## "numeric" "numeric" "numeric" "numeric"
Create scatter plots to examine the relationships between these variables.
pairs(Tablets)
To run a regression model, we assume that the relationship between the independent and dependent variables are linear.
Not really linear due to variable measurements.
Which analyses should you run to test this hypothesis?
Multiple regression
Run the appropriate statistical analysis.
# Multiple regression model
model1 <- lm(Desirability ~ Advert_Cool + Product_Cool, data = Tablets)
summary(model1)
##
## Call:
## lm(formula = Desirability ~ Advert_Cool + Product_Cool, data = Tablets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.94937 -0.51533 0.05063 0.48467 1.80271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.14305 0.13756 22.849 < 2e-16 ***
## Advert_Cool 0.18614 0.05959 3.124 0.00201 **
## Product_Cool 0.24791 0.05670 4.372 1.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7053 on 237 degrees of freedom
## Multiple R-squared: 0.1297, Adjusted R-squared: 0.1223
## F-statistic: 17.65 on 2 and 237 DF, p-value: 7.127e-08
The results of the multiple regression model suggest that both
Advert_Cool and Product_Cool are significant
predictors of Desirability.
For Advert_Cool, the estimated coefficient is
0.18614, with a standard error of 0.05959, a t-value of 3.124, and a
p-value of 0.00201. Since the p-value is less than the significance
level of 0.05, we can conclude that Advert_Cool
significantly predicts Desirability.
For Product_Cool, the estimated coefficient is
0.24791, with a standard error of 0.05670, a t-value of 4.372, and a
p-value of 1.84e-05. As the p-value is also less than the significance
level of 0.05, we can conclude that Product_Cool
significantly predicts Desirability.
What are the assumptions of your model? Test them.
Assumptions: In a multiple regression, we are interested in residuals.
plot(model1) function gives 4 plots.
plot(model1)
TRUE
TRUE enough
TRUE enough
max(cooks.distance(model1)) gives the highest value. Cook’s
> 1 is problematic.max(cooks.distance(model1))
## [1] 0.08821221
dwt() I want the Durbine Watson statistic to be
approximately 2. Values less than 1 or greater than 3 are
problematic.dwt(model1)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02263121 2.041017 0.722
## Alternative hypothesis: rho != 0
2.04 is good.
plot(rstandard(model1)) indicates that a data point is
really far away from what is predicted. Greater than 3 could be
problematic if you have a small dataset or more than 1% of the data
greater than 3. sum(rstandard(model1)>3)/n where “n” is
the length of the model.plot(rstandard(model1))
fine
vif(model1)vif(model1)
## Advert_Cool Product_Cool
## 1.037188 1.037188
Both values are less than 2.5, therefore I conclude that the predictors are not highly correlated with each other.
Cooks distance value was well below 1.
As Advert_cool increases by 1, what is the effect on
Desirability?
y = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ + ε
Desirability = β₀ + β₁(Advert_Cool) + β₂(Product_Cool) + ε
As Advert_Cool increases by 1, Desirability increases by β₁.
Summarize your findings in light of the original problem.
The CEO was interested in determining whether cool advertising
(Advert_Cool) and a cool product
(Product_Cool) influence a product’s desirability. Based on
the multiple regression analysis, both Advert_Cool and
Product_Cool were found to be significant predictors of
desirability (p < 0.05). The analysis shows that for each one-unit
increase in the coolness of advertising, the desirability score
increases by 0.186. Similarly, for each one-unit increase in the
coolness of the product, the desirability score increases by 0.248. The
adjusted R-squared value is 0.1223, which means that the model explains
approximately 12.23% of the variation in desirability.