Analyze the relationship between a company’s advertising
expenditure, its product price, future value, tax,interest rate, and its
sales revenue. Follow the instruction below:
a. Generate hypothetical data for 10002 observations.
set.seed(45) # For reproducibility
n <- 10009
expenditure <- runif(n, 1000, 5000)
product_price <- runif(n, 10, 50)
future_value <- runif(n, 10000, 20000)
tax <- runif(n, 0.1, 0.3)
interest_rate <- runif(n, 0.01, 0.05)
b. Create five independent variables: expenditure, its product
price, future value, tax, and interest rate.
data <- data.frame(expenditure, product_price, future_value, tax, interest_rate)
c. Generate a dependent variable, sales revenue, using a linear
relationship with the independent variables.
beta <- c(2, -0.5, 3, -1000, -500) # hypothetical coefficients
sales_revenue <- 50000 + expenditure * beta[1] + product_price * beta[2] + future_value * beta[3] + tax * beta[4] + interest_rate * beta[5] + rnorm(n)
data$sales_revenue <- sales_revenue
d. Fit a multiple regression model where dependent variables are
regressed to the independent variables.
model <- lm(sales_revenue ~ expenditure + product_price + future_value + tax + interest_rate, data = data)
model
##
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value +
## tax + interest_rate, data = data)
##
## Coefficients:
## (Intercept) expenditure product_price future_value tax
## 50000.0 2.0 -0.5 3.0 -1000.2
## interest_rate
## -499.3
e. Print a summary of the regression results, which includes
coefficients, standard errors, t-statistics, p-values, and
R-squared.
summary(model)
##
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value +
## tax + interest_rate, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5459 -0.6639 0.0074 0.6695 3.5755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.000e+04 7.699e-02 649412.6 <2e-16 ***
## expenditure 2.000e+00 8.580e-06 233101.0 <2e-16 ***
## product_price -5.000e-01 8.558e-04 -584.3 <2e-16 ***
## future_value 3.000e+00 3.444e-06 871141.4 <2e-16 ***
## tax -1.000e+03 1.710e-01 -5850.1 <2e-16 ***
## interest_rate -4.993e+02 8.588e-01 -581.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.992 on 10003 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.634e+11 on 5 and 10003 DF, p-value: < 2.2e-16
f. Plot the residuals against the fitted values to check for
heteroscedasticity (unequal variance) and nonlinearity.
plot(model$fitted.values, model$residuals)
abline(h = 0, col = "green")

g. Plot diagnostic plots to further assess the assumptions of linear
regression, including normality of residuals, constant variance, and
absence of influential outliers.
par(mfrow=c(2,2))
plot(model)

Investigate the factors influencing housing prices as the following
instructions:
a. Simulate a hypothetical dataset with 20002 observations
containing variables such as house size, number of bedrooms, city (five
cities), toll access (yes or no), age of the house, and price.
n <- 20009
house_size <- runif(n, 800, 3500) # square feet
num_bedrooms <- sample(1:5, n, replace = TRUE)
city <- sample(c("CityA", "CityB", "CityC", "CityD", "CityE"), n, replace = TRUE)
toll_access <- sample(c("yes", "no"), n, replace = TRUE)
age <- sample(1:100, n, replace = TRUE)
price <- 100000 + house_size * 200 - num_bedrooms * 10000 + ifelse(toll_access == "yes", 15000, -10000) - age * 500 + rnorm(n, 0, 10000)
data_housing <- data.frame(house_size, num_bedrooms, city, toll_access, age, price)
b. Fit a multiple regression model using the lm() function, where
the price of the house is the dependent variable, and house size, number
of bedrooms, city, and age are the independent variables.
model_housing <- lm(price ~ house_size + num_bedrooms + city + age, data = data_housing)
c. Convert the "city" and “toll access” variable to a factor to
treat it as a categorical variable.
data_housing$city <- as.factor(data_housing$city)
data_housing$toll_access <- as.factor(data_housing$toll_access)
d. Summarize the fitted regression model to analyze the
coefficients, standard errors, t-values, and p-values.
summary(model_housing)
##
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + age,
## data = data_housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49042 -12610 -95 12690 48286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.028e+05 5.074e+02 202.616 <2e-16 ***
## house_size 2.001e+02 1.458e-01 1372.344 <2e-16 ***
## num_bedrooms -1.003e+04 8.049e+01 -124.607 <2e-16 ***
## cityCityB -1.824e+02 3.610e+02 -0.505 0.613
## cityCityC -1.934e+02 3.569e+02 -0.542 0.588
## cityCityD -2.517e+02 3.576e+02 -0.704 0.481
## cityCityE -4.229e+01 3.571e+02 -0.118 0.906
## age -5.051e+02 3.944e+00 -128.059 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16050 on 20001 degrees of freedom
## Multiple R-squared: 0.9896, Adjusted R-squared: 0.9896
## F-statistic: 2.728e+05 on 7 and 20001 DF, p-value: < 2.2e-16
e. Check for multicollinearity using the Variance Inflation Factor
(VIF) to assess the correlation between independent variables.
library(car)
vif(model_housing)
## GVIF Df GVIF^(1/(2*Df))
## house_size 1.000306 1 1.000153
## num_bedrooms 1.000325 1 1.000163
## city 1.000622 4 1.000078
## age 1.000437 1 1.000218
g. Create diagnostic plots to assess the model's assumptions,
including residual plots against fitted values, Q-Q plots of residuals,
and plots of residuals against leverage.
par(mfrow=c(2,2))
plot(model_housing)
