UTS Ekonometrika

Analyze and Visualize

Naftali Brigitta Gunawan

March 18, 2024

Kontak	\(\downarrow\)
Email	naftaligunawan@gmail.com
Instagram	https://www.instagram.com/nbrigittag/
RPubs	https://rpubs.com/naftalibrigitta/
Nama	Naftali Brigitta Gunawan
NIM	20214920002

Analyze the relationship between a company’s advertising expenditure, its product price, future value, tax,interest rate, and its sales revenue. Follow the instruction below:

a. Generate hypothetical data for 10002 observations.

set.seed(123)  # For reproducibility
n <- 10002 # use 10002 data, because my last two digits of my student ID is 02. 
expenditure <- runif(n, 1000, 5000)
product_price <- runif(n, 10, 50)
future_value <- runif(n, 10000, 20000)
tax <- runif(n, 0.1, 0.3)
interest_rate <- runif(n, 0.01, 0.05)

b. Create five independent variables: expenditure, its product price, future value, tax, and interest rate.

data <- data.frame(expenditure, product_price, future_value, tax, interest_rate)

data

c. Generate a dependent variable, sales revenue, using a linear relationship with the independent variables.

set.seed(123) # Ensuring reproducibility
beta <- c(2, -0.5, 3, -1000, -500)
sales_revenue <- 50000 + 
                 data$expenditure * beta[1] + 
                 data$product_price * beta[2] + 
                 data$future_value * beta[3] + 
                 data$tax * beta[4] + 
                 data$interest_rate * beta[5] + 
                 rnorm(n)  # Making sure it generates n random numbers
data$sales_revenue <- sales_revenue

d. Fit a multiple regression model where dependent variables are regressed to the independent variables.

model <- lm(sales_revenue ~ expenditure + product_price + future_value + tax + interest_rate, data = data) 
model

## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate, data = data)
## 
## Coefficients:
##   (Intercept)    expenditure  product_price   future_value            tax  
##    50000.0081         2.0000        -0.4994         3.0000     -1000.3259  
## interest_rate  
##     -499.8633

e. Print a summary of the regression results, which includes coefficients, standard errors, t-statistics, p-values, and R-squared.

summary(model)

## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8319 -0.6707 -0.0067  0.6759  3.8392 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    5.000e+04  7.743e-02 645784.2   <2e-16 ***
## expenditure    2.000e+00  8.711e-06 229607.7   <2e-16 ***
## product_price -4.994e-01  8.663e-04   -576.5   <2e-16 ***
## future_value   3.000e+00  3.447e-06 870404.3   <2e-16 ***
## tax           -1.000e+03  1.736e-01  -5761.8   <2e-16 ***
## interest_rate -4.999e+02  8.649e-01   -578.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9989 on 9996 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.62e+11 on 5 and 9996 DF,  p-value: < 2.2e-16

Notes:
* the all variable have a p-value less than 0.05
* Multiple R-squared value breaks through at 100%, it’s very perfect number for multiple R-squared.
* The regression model is: Y = 5 + 2 X1 - 5.01 X2 + 3 X3 - 9.99 X4 - 4.98 X5

f. Plot the residuals against the fitted values to check for heteroscedasticity (unequal variance) and nonlinearity.

plot(model$fitted.values, model$residuals)
abline(h = 0, col = "red")

g. Plot diagnostic plots to further assess the assumptions of linear regression, including normality of residuals, constant variance, and absence of influential outliers.

par(mfrow=c(2,2))
plot(model)

Notes:
* In Residuals vs Fitted plot, there are 3 numbers are drawn, which means it has an influence. There are 3979, 7271, and 7556.
* In Normal Q-Q plot, the line is linear, which provides strong evidence that these numbers indeed come from a uniform distribution.

Investigate the factors influencing housing prices as the following instructions:

a. Simulate a hypothetical dataset with 20002 observations containing variables such as house size, number of bedrooms, city (five cities), toll access (yes or no), age of the house, and price.

n <- 20002
house_size <- runif(n, 800, 3500)  # square feet
num_bedrooms <- sample(1:5, n, replace = TRUE)
city <- sample(c("CityA", "CityB", "CityC", "CityD", "CityE"), n, replace = TRUE)
toll_access <- sample(c("yes", "no"), n, replace = TRUE)
age <- sample(1:100, n, replace = TRUE)
price <- 100000 + house_size * 200 - num_bedrooms * 10000 + ifelse(toll_access == "yes", 15000, -10000) - age * 500 + rnorm(n, 0, 10000)

data_housing <- data.frame(house_size, num_bedrooms, city, toll_access, age, price)

b. Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.

model_housing <- lm(price ~ house_size + num_bedrooms + city + age, data = data_housing)

c. Convert the “city” and “toll access” variable to a factor to treat it as a categorical variable.

data_housing$city <- as.factor(data_housing$city)
data_housing$toll_access <- as.factor(data_housing$toll_access)

d. Summarize the fitted regression model to analyze the coefficients, standard errors, t-values, and p-values.

summary(model_housing)

## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + age, 
##     data = data_housing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47710 -12636     48  12649  50417 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   1.032e+05  5.112e+02  201.887   <2e-16 ***
## house_size    2.000e+02  1.450e-01 1378.946   <2e-16 ***
## num_bedrooms -1.018e+04  8.032e+01 -126.728   <2e-16 ***
## cityCityB     9.943e-01  3.556e+02    0.003    0.998    
## cityCityC     3.221e+02  3.574e+02    0.901    0.367    
## cityCityD    -1.996e+02  3.569e+02   -0.559    0.576    
## cityCityE    -2.375e+02  3.554e+02   -0.668    0.504    
## age          -5.017e+02  3.915e+00 -128.140   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15990 on 19994 degrees of freedom
## Multiple R-squared:  0.9898, Adjusted R-squared:  0.9898 
## F-statistic: 2.77e+05 on 7 and 19994 DF,  p-value: < 2.2e-16

Notes:
* the all variable have a p-value less than 0.05, only the cityCityB until cityCityE have a p-value more than 0.05.
* Multiple R-squared value is 98.97%, it’s also perfect number for multiple R-squared.
* The regression model is: Y = 1.019e+05 + 2.001e+02 X1 - 9.962e+03 X2 + 2.542e+02 X3 + 2.751e+02 X4 + 4.588e+02 X5 + 6.870e+02 X6 - 4.988e+02 X7

e. Check for multicollinearity using the Variance Inflation Factor (VIF) to assess the correlation between independent variables.

library(car)
vif(model_housing)

##                  GVIF Df GVIF^(1/(2*Df))
## house_size   1.000482  1        1.000241
## num_bedrooms 1.000759  1        1.000379
## city         1.000733  4        1.000092
## age          1.000706  1        1.000353

Notes:
* The all GVIF output are close to 1, which indicates no problematic multicollinearity.

f. Perform diagnostic tests for heteroskedasticity using the Breusch-Pagan test and for linearity using the Rainbow test.

library(lmtest)
library(gvlma)

bptest(model_housing)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_housing
## BP = 2.7678, df = 7, p-value = 0.9056

gvlma(model_housing)

## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + age, 
##     data = data_housing)
## 
## Coefficients:
##  (Intercept)    house_size  num_bedrooms     cityCityB     cityCityC  
##    1.032e+05     2.000e+02    -1.018e+04     9.943e-01     3.221e+02  
##    cityCityD     cityCityE           age  
##   -1.996e+02    -2.375e+02    -5.017e+02  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model_housing) 
## 
##                        Value p-value                   Decision
## Global Stat        4.708e+02  0.0000 Assumptions NOT satisfied!
## Skewness           3.138e-03  0.9553    Assumptions acceptable.
## Kurtosis           4.701e+02  0.0000 Assumptions NOT satisfied!
## Link Function      3.926e-01  0.5310    Assumptions acceptable.
## Heteroscedasticity 3.074e-01  0.5793    Assumptions acceptable.

Notes:
* Global Stat: This is a test statistic that checks the overall goodness of fit of the model. A significant p-value (close to 0) indicates that the model does not fit the data well, as seen here with a p-value of 0.0000000, suggesting that the model assumptions are not satisfied.
* Skewness: This measures the asymmetry of the distribution of residuals. A p-value greater than 0.05, as seen here with a p-value of 0.9699681 generally suggests that the skewness is not significantly different from normal (which would be 0), meaning this model assumption is acceptable.
* Kurtosis: This measures the ‘tailedness’ of the distribution of residuals. A significant p-value (close to 0), as seen here with a p-value of 0.0000000, indicates that the kurtosis of the residuals is significantly different from that of a normal distribution, which is again the case here, indicating that the model assumptions are not satisfied.
* Link Function: This test is relevant for generalized linear models and checks if the chosen link function is appropriate for the data. A non-significant p-value (greater than 0.05), as seen here with a p-value of 0.4032364 suggests that the model assumptions is acceptable.
* Heteroscedasticity: This checks if the variance of the residuals is constant across the range of predicted values. A non-significant p-value (greater than 0.05) indicates that the variance is constant, which is an assumption of regression models. Here, the p-value is somewhat borderline at 0.144716, suggesting that the assumption of constant variance is somewhat acceptable, but may be on the verge of being problematic.

g. Create diagnostic plots to assess the model’s assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.

par(mfrow=c(2,2))
plot(model_housing)

Notes:
* In Residuals vs Fitted plot, there are 3 numbers are drawn, which means it has an influence. There are 4170, 4911, and 17745.
* In Normal Q-Q plot, the line is linear, which provides strong evidence that these numbers indeed come from a uniform distribution.