Mid-Test

Ekonometrika

Kontak	: \(\downarrow\)
Email	mugemisausan05@gmail.com
Instagram	https://www.instagram.com/saram.05/
RPubs	https://rpubs.com/sausanramadhani/

1. Analyze The Relationship

Analyze the relationship between a company’s advertising expenditure, its product price, future value, tax, interest rate, and its sales revenue. Follow the instruction below :

1.1 Generate Hypothetical data

Generate hypothetical data for 100* observations (replace the * sign with the last two digits of your Student ID number). My Student ID number is 20214920004, so 04 is the last two digits of that.

set.seed(211)
n <- 10004

Hypothetical data has been generated for 10004 observations

1.2 Create Five Independent Variables

Create five independent variables : expenditure, its product, future value, tax, and interest rate.

expenditure <- rnorm(n, mean = 1000, sd = 200)
product_price <- rnorm(n, mean = 50, sd = 10)
future_value <- rnorm(n, mean = 200, sd = 50)
tax <- rnorm(n, mean = 0.1, sd = 0.02)
interest_rate <- rnorm(n, mean = 0.05, sd = 0.01)

1.3 Generate a Dependent Variable

Generate a dependent variable, sales revenue, using a linear relationship with the independent variables.

sales_revenue <- 1000 + 5*expenditure - 2*product_price + 10*future_value - 500*tax + 2000*interest_rate + rnorm(10004, mean = 0, sd = 100)

The following is a data frame containing Independent Variables and Dependent Variables:

num1 <- data.frame(expenditure, product_price, future_value, tax, interest_rate, sales_revenue)
num1

1.4 Fit a Multiple Regression Model

Fit a multiple regression model where dependent variables are regressed to the independent variables.

model_num1 <- lm(sales_revenue ~ expenditure + product_price + future_value + tax + interest_rate, num1)

1.5 Print a Summary

Print a summary of the regression results, which includes coefficients, standard errors, t-statistics, p-value, and R-squared.

summary(model_num1)

## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate, data = num1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -340.05  -66.67   -1.03   67.84  362.56 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.997e+02  1.089e+01   91.83   <2e-16 ***
## expenditure    4.991e+00  5.026e-03  993.02   <2e-16 ***
## product_price -1.934e+00  1.008e-01  -19.18   <2e-16 ***
## future_value   1.002e+01  2.007e-02  499.40   <2e-16 ***
## tax           -4.923e+02  5.086e+01   -9.68   <2e-16 ***
## interest_rate  2.024e+03  1.004e+02   20.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.5 on 9998 degrees of freedom
## Multiple R-squared:  0.992,  Adjusted R-squared:  0.992 
## F-statistic: 2.49e+05 on 5 and 9998 DF,  p-value: < 2.2e-16

Based on summary of the output above, the p-value < 2.2e-16 is obtained, which means the regression model is significant so the model can be used. A Multiple R-Squared of 99% means the fit is very good. Overall, the model is very good at predicting sales_revenue based on the selected variables, with all predictor variables showing strong statistical significance.

1.6 Plot The Residuals

Plot the residuals against the fitted values to check for heteroscedasticity (unequal variance) and nonlinearity.

plot(model_num1, which = 1, , col="#824D74")

The residual plot above shows a straight line so the data is linear. Judging from the distribution, the data is homogen.

1.7 Diagnostic Plots

Plot diagnostic plots to further assess the assumptions of linear regression, including normality of residuals, constant variance, and absence of influential outliers.

library(dplyr)
library(ggfortify)
par(mfrow = c(2,2))
autoplot(model_num1)

Residual vs Fitted : This plot shows that the data has a linear pattern and is evenly distributed around the horizontal line, so the data meets the regression assumptions very well.

Normal Q-Q : The Normal Q-Q plot shows that the data is normally distributed because the data follows a straight line well.

Scale-Location : the Location-Scale plot shows that the residues are evenly distributed.

plot(model_num1, 4, col="#6D2932")

Residuals vs Leverage : in the residual vs leverage plot using cooking distance there are 3 outliers.

2. Investigate The Factors

Investigate the factors influencing housing prices as the following instructions :

2.1 Simulate A Hypothetical Dataset

Simulate a hypothetical dataset with 20004 observations constaining variables such as house size, number of bedrooms, city (five cities), toll access (yes or no), age of the house, and price.

library(tidyverse)

set.seed(502)
num2 <- tibble(
  house_size = rnorm(20004, mean = 1500, sd = 300),
  num_bedrooms = sample(2:5, 20004, replace = TRUE),
  city = sample(c("Jakarta", "Bogor", "Depok", "Tangerang", "Bekasi"), 20004, replace = TRUE),
  toll_access = sample(c("Yes", "No"), 20004, replace = TRUE),
  age_of_house = rnorm(20004, mean = 20, sd = 10),
  price = rnorm(20004, mean = 300000, sd = 50000)
)

2.2 Fit A Multiple Regression

Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.

model_num2 <- lm(price ~ house_size + num_bedrooms + city + age_of_house, data = num2)

2.3 Convert Variable to a Factor

Convert the “city” and “toll access” variable to a factor to treat it as a categorical variable.

num2$city <- as.factor(num2$city)
num2$toll_access <- as.factor(num2$toll_access)

2.4 Summarize The Fitted Regression Model

Summarize the fitted regression model to analyze the coefficients, standard errors, t-values, and p-values.

summary(model_num2)

## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + age_of_house, 
##     data = num2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -210204  -33759     285   33642  170659 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   297375.454   2335.560 127.325   <2e-16 ***
## house_size         1.618      1.183   1.367    0.171    
## num_bedrooms    -130.183    315.442  -0.413    0.680    
## cityBogor       -817.092   1116.151  -0.732    0.464    
## cityDepok        758.217   1113.766   0.681    0.496    
## cityJakarta     1546.492   1111.149   1.392    0.164    
## cityTangerang   1317.354   1117.415   1.179    0.238    
## age_of_house      -3.022     35.069  -0.086    0.931    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49860 on 19996 degrees of freedom
## Multiple R-squared:  0.0004114,  Adjusted R-squared:  6.144e-05 
## F-statistic: 1.176 on 7 and 19996 DF,  p-value: 0.3129

Based on the summary of the output above, p-value = 0.3129, meaning the regression model is not significant because p-value>alpha (0.05). Multiple R-Squared is 0.04% which is a very small value so the fit is very poor.

2.5 Check for Multicollinearity

Check for multicollinearity using the Variance Inflation Factor (VIF) to assess the correlation between independent variables.

library(car)
vif(model_num2)

##                  GVIF Df GVIF^(1/(2*Df))
## house_size   1.000217  1        1.000108
## num_bedrooms 1.000295  1        1.000147
## city         1.000784  4        1.000098
## age_of_house 1.000427  1        1.000214

By using VIF, all variables show a GVIF value very close to 1, meaning that multicollinearity is not a significant problem in this model.

2.6 Perform Diagnostic Tests

Perform diagnostic tests for heteroskedasticity using the Breusch-Pagan test and for linearity using the Rainbow test.

library(lmtest)

bptest(model_num2)   #heteroskedasticity using the Breusch-Pagan test

## 
##  studentized Breusch-Pagan test
## 
## data:  model_num2
## BP = 6.1171, df = 7, p-value = 0.5261

raintest(model_num2)   #linearity using the Rainbow test

## 
##  Rainbow test
## 
## data:  model_num2
## Rain = 0.97778, df1 = 10002, df2 = 9994, p-value = 0.8694

Based on Breusch-Pagan Test, BP=6.1171 with 7 df and p-value=0.5261 was obtained. These results show that there is no evidence of significant heteroscedasticity in the model (p-value>alpha0.05).

Based on rainbow test, rain is 0.97778 with df1=10002 and df2=9994, and p-value=0.8694. These results indicate that there is no evidence to support a violation of the model linearity assumption because p-value>alpha(0.05).

2.7 Create Diagnostic Plots

Create diagnostic plots to assess the model’s assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.

par(mfrow = c(2, 2))
plot(model_num2, col = "#6196A6")

Residual vs Fitted : This plot shows that the data has a linear pattern and is evenly distributed around the horizontal line, so the data meets the regression assumptions very well.

Normal Q-Q : The Normal Q-Q plot shows that the data is normally distributed because the data follows a straight line well.

Scale-Location : the Location-Scale plot shows that the residues are evenly distributed.

plot(model_num2, 4, col="#436850")

Residuals vs Leverage : in the residual vs leverage plot using cooking distance there are 3 outliers.