UTS Ekonometrika

Jeremi Heryandi Saudi

March 18, 2024


Analyze the relationship between a company’s advertising expenditure, its product price, future value, tax,interest rate, and its sales revenue. Follow the instruction below:

a. Generate hypothetical data for 10008 observations.

set.seed(10000)  # For reproducibility
n <- 10008
expenditure <- runif(n, 1000, 5000)
product_price <- runif(n, 10, 50)
future_value <- runif(n, 10000, 20000)
tax <- runif(n, 0.1, 0.3)
interest_rate <- runif(n, 0.01, 0.05)

b. Create five independent variables: expenditure, its product price, future value, tax, and interest rate.

data <- data.frame(expenditure, product_price, future_value, tax, interest_rate)

c. Generate a dependent variable, sales revenue, using a linear relationship with the independent variables.

beta <- c(2, -0.5, 3, -1000, -500)  # hypothetical coefficients
sales_revenue <- 50000 + expenditure * beta[1] + product_price * beta[2] + future_value * beta[3] + tax * beta[4] + interest_rate * beta[5] + rnorm(n)
data$sales_revenue <- sales_revenue

d. Fit a multiple regression model where dependent variables are regressed to the independent variables.

model <- lm(sales_revenue ~ expenditure + product_price + future_value + tax + interest_rate, data = data)
model
## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate, data = data)
## 
## Coefficients:
##   (Intercept)    expenditure  product_price   future_value            tax  
##    50000.0203         2.0000        -0.4987         3.0000     -1000.1249  
## interest_rate  
##     -498.7500

e. Print a summary of the regression results, which includes coefficients, standard errors, t-statistics, p-values, and R-squared.

summary(model)
## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4517 -0.6685 -0.0024  0.6820  3.6018 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    5.000e+04  7.760e-02 644344.0   <2e-16 ***
## expenditure    2.000e+00  8.685e-06 230267.3   <2e-16 ***
## product_price -4.987e-01  8.675e-04   -574.8   <2e-16 ***
## future_value   3.000e+00  3.480e-06 862086.4   <2e-16 ***
## tax           -1.000e+03  1.734e-01  -5767.8   <2e-16 ***
## interest_rate -4.988e+02  8.710e-01   -572.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 10002 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.6e+11 on 5 and 10002 DF,  p-value: < 2.2e-16

f. Plot the residuals against the fitted values to check for heteroscedasticity (unequal variance) and nonlinearity.

plot(model$fitted.values, model$residuals)
abline(h = 0, col = "green")

g. Plot diagnostic plots to further assess the assumptions of linear regression, including normality of residuals, constant variance, and absence of influential outliers.

par(mfrow=c(2,2))
plot(model)

Investigate the factors influencing housing prices as the following instructions:

a. Simulate a hypothetical dataset with 20002 observations containing variables such as house size, number of bedrooms, city (five cities), toll access (yes or no), age of the house, and price.

n <- 20008
house_size <- runif(n, 800, 3500)  # square feet
num_bedrooms <- sample(1:5, n, replace = TRUE)
city <- sample(c("CityA", "CityB", "CityC", "CityD", "CityE"), n, replace = TRUE)
toll_access <- sample(c("yes", "no"), n, replace = TRUE)
age <- sample(1:100, n, replace = TRUE)
price <- 100000 + house_size * 200 - num_bedrooms * 10000 + ifelse(toll_access == "yes", 15000, -10000) - age * 500 + rnorm(n, 0, 10000)

data_housing <- data.frame(house_size, num_bedrooms, city, toll_access, age, price)

b. Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.

model_housing <- lm(price ~ house_size + num_bedrooms + city + age, data = data_housing)

c. Convert the "city" and “toll access” variable to a factor to treat it as a categorical variable.

data_housing$city <- as.factor(data_housing$city)
data_housing$toll_access <- as.factor(data_housing$toll_access)

d. Summarize the fitted regression model to analyze the coefficients, standard errors, t-values, and p-values.

summary(model_housing)
## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + age, 
##     data = data_housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56379 -12563   -315  12704  48171 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   1.017e+05  5.060e+02  200.943   <2e-16 ***
## house_size    2.001e+02  1.449e-01 1381.121   <2e-16 ***
## num_bedrooms -9.855e+03  7.955e+01 -123.894   <2e-16 ***
## cityCityB    -2.806e+02  3.539e+02   -0.793    0.428    
## cityCityC     2.836e+02  3.554e+02    0.798    0.425    
## cityCityD    -2.220e+02  3.562e+02   -0.623    0.533    
## cityCityE     2.189e+02  3.540e+02    0.618    0.536    
## age          -4.990e+02  3.889e+00 -128.291   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15940 on 20000 degrees of freedom
## Multiple R-squared:  0.9898, Adjusted R-squared:  0.9898 
## F-statistic: 2.782e+05 on 7 and 20000 DF,  p-value: < 2.2e-16

e. Check for multicollinearity using the Variance Inflation Factor (VIF) to assess the correlation between independent variables.

library(car)
vif(model_housing)
##                  GVIF Df GVIF^(1/(2*Df))
## house_size   1.000486  1        1.000243
## num_bedrooms 1.000366  1        1.000183
## city         1.000740  4        1.000092
## age          1.000331  1        1.000166

f. Perform diagnostic tests for heteroskedasticity using the Breusch-Pagan test and for linearity using the Rainbow test.

library(lmtest)
library(gvlma)

bptest(model_housing)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_housing
## BP = 8.2055, df = 7, p-value = 0.3148
gvlma(model_housing)
## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + age, 
##     data = data_housing)
## 
## Coefficients:
##  (Intercept)    house_size  num_bedrooms     cityCityB     cityCityC  
##     101679.1         200.1       -9855.2        -280.6         283.6  
##    cityCityD     cityCityE           age  
##       -222.0         218.9        -499.0  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model_housing) 
## 
##                       Value p-value                   Decision
## Global Stat        434.0789  0.0000 Assumptions NOT satisfied!
## Skewness             0.6789  0.4100    Assumptions acceptable.
## Kurtosis           432.4179  0.0000 Assumptions NOT satisfied!
## Link Function        0.5334  0.4652    Assumptions acceptable.
## Heteroscedasticity   0.4487  0.5029    Assumptions acceptable.

g. Create diagnostic plots to assess the model's assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.

par(mfrow=c(2,2))
plot(model_housing)

References: