Mid-Test
Ekonometrika
| Kontak | : \(\downarrow\) |
| mugemisausan05@gmail.com | |
| https://www.instagram.com/saram.05/ | |
| RPubs | https://rpubs.com/sausanramadhani/ |
1. Analyze The Relationship
Analyze the relationship between a company’s advertising expenditure, its product price, future value, tax, interest rate, and its sales revenue. Follow the instruction below :
1.1 Generate Hypothetical data
Generate hypothetical data for 100* observations (replace the * sign
with the last two digits of your Student ID number). My Student ID
number is 20214920004, so 04 is the last two digits of that.
Hypothetical data has been generated for 10004 observations
1.2 Create Five Independent Variables
Create five independent variables : expenditure, its product, future
value, tax, and interest rate.
1.3 Generate a Dependent Variable
Generate a dependent variable, sales revenue, using a linear
relationship with the independent variables.
sales_revenue <- 1000 + 5*expenditure - 2*product_price + 10*future_value - 500*tax + 2000*interest_rate + rnorm(10004, mean = 0, sd = 100)The following is a data frame containing Independent Variables and Dependent Variables:
num1 <- data.frame(expenditure, product_price, future_value, tax, interest_rate, sales_revenue)
num11.4 Fit a Multiple Regression Model
Fit a multiple regression model where dependent variables are
regressed to the independent variables.
1.5 Print a Summary
Print a summary of the regression results, which includes
coefficients, standard errors, t-statistics, p-value, and R-squared.
##
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value +
## tax + interest_rate, data = num1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -340.05 -66.67 -1.03 67.84 362.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.997e+02 1.089e+01 91.83 <2e-16 ***
## expenditure 4.991e+00 5.026e-03 993.02 <2e-16 ***
## product_price -1.934e+00 1.008e-01 -19.18 <2e-16 ***
## future_value 1.002e+01 2.007e-02 499.40 <2e-16 ***
## tax -4.923e+02 5.086e+01 -9.68 <2e-16 ***
## interest_rate 2.024e+03 1.004e+02 20.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.5 on 9998 degrees of freedom
## Multiple R-squared: 0.992, Adjusted R-squared: 0.992
## F-statistic: 2.49e+05 on 5 and 9998 DF, p-value: < 2.2e-16
Based on summary of the output above, the p-value < 2.2e-16 is
obtained, which means the regression model is significant so the model
can be used. A Multiple R-Squared of 99% means the fit is very good.
Overall, the model is very good at predicting sales_revenue
based on the selected variables, with all predictor variables showing
strong statistical significance.
1.6 Plot The Residuals
Plot the residuals against the fitted values to check for
heteroscedasticity (unequal variance) and nonlinearity.
The residual plot above shows a straight line so the data is linear. Judging from the distribution, the data is homogen.
1.7 Diagnostic Plots
Plot diagnostic plots to further assess the assumptions of linear regression, including normality of residuals, constant variance, and absence of influential outliers.
Residual vs Fitted : This plot shows that the data has a linear pattern and is evenly distributed around the horizontal line, so the data meets the regression assumptions very well.
Normal Q-Q : The Normal Q-Q plot shows that the data is normally distributed because the data follows a straight line well.
Scale-Location : the Location-Scale plot shows that the residues are evenly distributed.
Residuals vs Leverage : in the residual vs leverage plot using cooking distance there are 3 outliers.
2. Investigate The Factors
Investigate the factors influencing housing prices as the following instructions :
2.1 Simulate A Hypothetical Dataset
Simulate a hypothetical dataset with 20004 observations constaining
variables such as house size, number of bedrooms, city (five cities),
toll access (yes or no), age of the house, and price.
library(tidyverse)
set.seed(502)
num2 <- tibble(
house_size = rnorm(20004, mean = 1500, sd = 300),
num_bedrooms = sample(2:5, 20004, replace = TRUE),
city = sample(c("Jakarta", "Bogor", "Depok", "Tangerang", "Bekasi"), 20004, replace = TRUE),
toll_access = sample(c("Yes", "No"), 20004, replace = TRUE),
age_of_house = rnorm(20004, mean = 20, sd = 10),
price = rnorm(20004, mean = 300000, sd = 50000)
)2.2 Fit A Multiple Regression
Fit a multiple regression model using the lm() function,
where the price of the house is the dependent variable, and house size,
number of bedrooms, city, and age are the independent variables.
2.3 Convert Variable to a Factor
Convert the “city” and “toll access” variable to a factor to treat it
as a categorical variable.
2.4 Summarize The Fitted Regression Model
Summarize the fitted regression model to analyze the coefficients,
standard errors, t-values, and p-values.
##
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + age_of_house,
## data = num2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -210204 -33759 285 33642 170659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 297375.454 2335.560 127.325 <2e-16 ***
## house_size 1.618 1.183 1.367 0.171
## num_bedrooms -130.183 315.442 -0.413 0.680
## cityBogor -817.092 1116.151 -0.732 0.464
## cityDepok 758.217 1113.766 0.681 0.496
## cityJakarta 1546.492 1111.149 1.392 0.164
## cityTangerang 1317.354 1117.415 1.179 0.238
## age_of_house -3.022 35.069 -0.086 0.931
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49860 on 19996 degrees of freedom
## Multiple R-squared: 0.0004114, Adjusted R-squared: 6.144e-05
## F-statistic: 1.176 on 7 and 19996 DF, p-value: 0.3129
Based on the summary of the output above, p-value = 0.3129, meaning the regression model is not significant because p-value>alpha (0.05). Multiple R-Squared is 0.04% which is a very small value so the fit is very poor.
2.5 Check for Multicollinearity
Check for multicollinearity using the Variance Inflation Factor (VIF)
to assess the correlation between independent variables.
## GVIF Df GVIF^(1/(2*Df))
## house_size 1.000217 1 1.000108
## num_bedrooms 1.000295 1 1.000147
## city 1.000784 4 1.000098
## age_of_house 1.000427 1 1.000214
By using VIF, all variables show a GVIF value very close to 1, meaning that multicollinearity is not a significant problem in this model.
2.6 Perform Diagnostic Tests
Perform diagnostic tests for heteroskedasticity using the
Breusch-Pagan test and for linearity using the Rainbow test.
##
## studentized Breusch-Pagan test
##
## data: model_num2
## BP = 6.1171, df = 7, p-value = 0.5261
##
## Rainbow test
##
## data: model_num2
## Rain = 0.97778, df1 = 10002, df2 = 9994, p-value = 0.8694
Based on Breusch-Pagan Test, BP=6.1171 with 7 df and p-value=0.5261 was obtained. These results show that there is no evidence of significant heteroscedasticity in the model (p-value>alpha0.05).
Based on rainbow test, rain is 0.97778 with df1=10002 and df2=9994, and p-value=0.8694. These results indicate that there is no evidence to support a violation of the model linearity assumption because p-value>alpha(0.05).
2.7 Create Diagnostic Plots
Create diagnostic plots to assess the model’s assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.
Residual vs Fitted : This plot shows that the data has a linear pattern and is evenly distributed around the horizontal line, so the data meets the regression assumptions very well.
Normal Q-Q : The Normal Q-Q plot shows that the data is normally distributed because the data follows a straight line well.
Scale-Location : the Location-Scale plot shows that the residues are evenly distributed.
Residuals vs Leverage : in the residual vs leverage plot using cooking distance there are 3 outliers.