Ekonometrika

~ Mid Exam ~

Kontak	: \(\downarrow\)
Email	diyasaryanugroho@gmail.com
Instagram	https://www.instagram.com/diasary_nm/
RPubs	https://rpubs.com/diyasarya/

library(broom)
library(DT)
library(ggplot2)
library(lmtest)
library(car)

Soal 1

Analyze the relationship between a company’s advertising expenditure, its product price, future value, tax, interest rate, and its sales revenue. Follow the instruction below:

# a. Generate hypothetical data for 100* observations.
set.seed(863)
obs <- 10007

# b. Create five independent variables: expenditure, its product price, future value, tax, and interest rate.
expenditure <- round(rnorm(obs, mean=250, sd=25), 2)
product_price <- round(rnorm(obs, mean=50, sd=12), 2)
tax <- round(product_price * rnorm(obs, mean=0.15, sd=0.02), 2)
interest_rate <- round(rnorm(obs, mean=0.1, sd=0.02), 2)
future_value <- round(product_price * (1 + interest_rate)^5 + tax, 2)

data <- data.frame(expenditure, product_price, future_value, tax, interest_rate)
data

Keterangan:
- Expenditure yaitu pengeluaran yang diharapkan berkisar +/- 250 (USD)
- Product Price yaitu harga satuan produk yang dijual
- Future Value yaitu harga satuan produk dalam lima tahun kedepan (menggunakan rumus FV + Tax)
- Tax yaitu beban pajak yang ditanggung per satuan produk
- Interest Rate yaitu persentase bunga berkisar antara +/- 10%

# c. Generate a dependent variable, sales revenue, using a linear relationship with the independent variables.
sales_revenue <- round(rnorm(obs, mean=100, sd=18) * future_value, 2)
data <- cbind(data, sales_revenue)
data

Disini untuk menentukan pendapatan saya menggunakan rata-rata produk yang harus terjual per hari nya yaitu +/- 100 pcs lalu dikali dengan harga produk pada lima tahun yang akan mendatang. Sales Revenue menggunakan USD.

# d. Fit a multiple regression model where dependent variables are regressed to the independent variables.
model <- lm(sales_revenue ~ expenditure + product_price + future_value + tax + interest_rate, data = data)

# e. Print a summary of the regression results, which includes coefficients, standard errors, t-statistics, p-values, and R-squared.
summary(model)

## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7936.6 -1016.2   -19.8  1021.7  6920.9 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -287.6851   383.6111  -0.750    0.453    
## expenditure      0.9524     0.6573   1.449    0.147    
## product_price    8.2143    14.8386   0.554    0.580    
## future_value    96.5330     9.0438  10.674   <2e-16 ***
## tax            -15.1991    18.1810  -0.836    0.403    
## interest_rate  704.3917  3417.4432   0.206    0.837    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1643 on 10001 degrees of freedom
## Multiple R-squared:  0.6474, Adjusted R-squared:  0.6472 
## F-statistic:  3672 on 5 and 10001 DF,  p-value: < 2.2e-16

Dari pemodelan regresi berganda didapat:
1. Persamaan Regresi : Y = -287.69 + 0.95expenditure + 8.21product_price + 96.53future_value - 15.2tax + 705.39interest_rate, Y(Sales_Revenue)
2. F-statistic = 3672 dengan p-value < 2.2e-16 yang artinya H0 ditolak atau terdapat minimal satu variabel bebas yang memiliki pengaruh yang signifikan antara variabel bebas (Expenditure, Product Price, Future Value, Tax, Interest Rate) dengan variabel terikat (Sales Revenue).
3. R-Squared = 0.6474 yang artinya variabel bebas mampu menjelaskan varians variabel bebas sebesar 64,74% sisanya 35,26% dijelaskan oleh faktor lain yang tidak terdapat pada model regresi atau tidak diteliti.
4. T-statistic = 10,67 dengan p-value < 2e-16 yang artinya H0 ditolak atau variabel Future Value berpengaruh terhadap variabel Sales Revenue. Sedangkan variabel lainnya memiliki p-value yang lebih besar dari 0.05.

# f. Plot the residuals against the fitted values to check for heteroscedasticity (unequal variance) and nonlinearity.
resid <- augment(model)

ggplot(model, aes(.fitted, .resid)) +
  geom_point() +
  stat_smooth(method = lm, se=FALSE) +
  geom_segment(aes(xend=.fitted, yend=.resid), color = "red", size=0.3)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## i Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

Berdasarkan plot diatas, terlihat bahwa tidak membentuk pola tertentu yang artinya model regresi memiliki hubungan yang linear. Dan terlihat bahwa variansi kesalahan relatif konstan sepanjang nilai-nilai prediksi yang artinya bahwa residu homogen.

# g. Plot diagnostic plots to further assess the assumptions of linear regression, including normality of residuals, constant variance, and absence of influential outliers.

# Create diagnostic plots
par(mfrow = c(2, 2))  # Set up a 2x2 grid for plots

# 1. Residuals vs Fitted Values Plot
plot(model, which = 1)

# 2. Normal Q-Q Plot
plot(model, which = 2)

# 3. Scale-Location Plot (Squared Root of Standardized Residuals vs Fitted Values)
plot(model, which = 3)

# 4. Residuals vs Leverage Plot
plot(model, which = 4)

# Reset the plotting layout to default
par(mfrow = c(1, 1))

Dari beberapa plot diatas disimpulkan bahwa:
1. Plot Residuals vs Fitted menunjukkan tidak terdapat pola yang artinya model tersebut memiliki hubungan linear.
2. Plot Normal Q-Q menunjukkan titik-titik residu mengikuti garis lurus yang artinya model tersebut berdistribusi normal.
3. Plot Scale-Location menunjukkan variansi residu merata di sepanjang garis prediksi yang artinya model tersebut homogen.
4. Plot Cook’s distance menunjukkan terdapat titik ekstrim pada (#649, #4501, dan #8483) namun, hal ini tidak berarti terdapat outlier karenaa tidak lebih dari 3 titik menurut (James et al, 2014).

Soal 2

Investigate the factors influencing housing prices as the following instructions:

# a. Simulate a hypothetical dataset with 200* observations containing variables such as house size, number of bedrooms, city (five cities), toll access (yes or no), age of the house, and price.
set.seed(863)
obss <- 20007

house_size <- round(rnorm(obss, mean=200, sd=50), 2)
num_bedrooms <- ifelse(house_size >=200, sample(7:9, obss, replace = TRUE), 
                       ifelse(house_size >= 100, sample(4:6, obss, replace = TRUE), sample(1:3, obss, replace = TRUE)))
city <- sample(c(1:5), obss, replace = TRUE)
toll_access <- sample(c(0:1), obss, replace = TRUE)
house_age <- sample(1:15, obss, replace = TRUE)
price <- round(rnorm(obss, mean=100, sd=50) + rnorm(obss, mean=15, sd=3)*house_size + rnorm(obss, mean=7, sd=2)*num_bedrooms - rnorm(obss, mean=5, sd=1)*house_age + ifelse(toll_access==1, 2.5, 0) + ifelse(city==1, 5, ifelse(city==2, 4.5, ifelse(city==3, 2.5, ifelse(city==4, 3, 2.7)))), 2)

dataa <- data.frame(house_size, num_bedrooms, city, toll_access, house_age, price)
dataa

Keterangan:
- House Size yaitu luas bangunan dalam satuan meter persegi. Harga tanah permeter diharapkan berkisar +/- 15jt
- Number of Bedrooms yaitu jumlah kamar tidur. Perkamar tidur, harga rumah akan meningkat sebesar +/- 7jt
- City yaitu lima kota seperti (1=Jakarta, 2=Tangerang, 3=Bandung, 4=Depok, 5=Bekasi)
- Toll Access yaitu lokasi rumah dekat dengan jalan tol (1=No dan 2=Yes)
- House Age yaitu usia rumah dalam satuan tahun. Lamanya rumah berdiri akan mengurangi harga rumah sebesar +/- 5jt pertahunnya.
- Price yaitu harga rumah (dalam Jt Rupiah) dengan kriteria sbg berikut:
a. Jika City = 1 atau Jakarta harga rumah ditambahkan 5jt.
b. Jika City = 2 atau Tangerang harga rumah ditambahkan 4,5jt.
c. Jika City = 3 atau Bandung harga rumah ditambahkan 2,5jt.
d. Jika City = 4 atau Depok harga rumah ditambahkan 3jt.
e. Jika City = 5 atau Bekasi harga rumah ditambahkan 2,7jt.

# b. Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.
modell <- lm(price ~ house_size + num_bedrooms + city + house_age, data = dataa)
modell

## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + house_age, 
##     data = dataa)
## 
## Coefficients:
##  (Intercept)    house_size  num_bedrooms          city     house_age  
##     118.9762       14.7830       10.2580       -0.4247       -4.6062

Dengan menggunakan fungsi lm(), didapat persamaan regresinya adalah Y = 118.98 + 14.78house_size + 10.26num_bedrooms - 0.42city - 4.61house_age, dengan variabel terikatnya adalah Price.

# c. Convert the "city" and “toll access” variable to a factor to treat it as a categorical variable.
dataaa <- dataa
dataaa$city <- as.factor(dataaa$city)
dataaa$toll_access <- as.factor(dataaa$toll_access)
dataaa

# d. Summarize the fitted regression model to analyze the coefficients, standard errors, t-values, and p-values.
modell <- lm(price ~ house_size + num_bedrooms + city + toll_access + house_age, data = dataaa)
summary(modell)

## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + toll_access + 
##     house_age, data = dataaa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2848.86  -385.26     0.06   383.60  2499.07 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  112.2316    22.1295   5.072 3.98e-07 ***
## house_size    14.7848     0.1318 112.181  < 2e-16 ***
## num_bedrooms  10.2182     3.5640   2.867  0.00415 ** 
## city2          8.7246    13.8276   0.631  0.52808    
## city3         -3.0946    13.9027  -0.223  0.82386    
## city4        -19.0832    13.7662  -1.386  0.16569    
## city5         12.3213    13.9091   0.886  0.37571    
## toll_access1  11.2201     8.7047   1.289  0.19742    
## house_age     -4.6023     1.0041  -4.584 4.60e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 615.5 on 19998 degrees of freedom
## Multiple R-squared:  0.5981, Adjusted R-squared:  0.5979 
## F-statistic:  3720 on 8 and 19998 DF,  p-value: < 2.2e-16

Dari pemodelan regresi berganda didapat:
1. Persamaan Regresi : Y = 112.23 + 14.78house_size + 10.22num_bedrooms + 8.72city2 - 3.09city3 - 19.08city4 + 12.32city5 + 11.22toll_access1 - 4.60house_age, Y(Price). Pada variabel ini untuk koefisien dari variabel city terbagi menjadi 4 dan toll_access 1, sehingga tidak dapat mewakili untuk keseluruhan variabel city dan toll_access, hal ini terjadi karena variabel city dan toll_access merupakan jenis data kategorik bukan numerik.
2. F-statistic = 3720 dengan p-value < 2.2e-16 yang artinya H0 ditolak atau terdapat minimal satu variabel bebas yang memiliki pengaruh yang signifikan antara variabel bebas (House Size, Numbers of Bedrooms, City, Toll Access, dan House Age) dengan variabel terikat (Price).
3. R-Squared = 0.5981 yang artinya variabel bebas mampu menjelaskan varians variabel bebas sebesar 59,81% sisanya 40,19% dijelaskan oleh faktor lain yang tidak terdapat pada model regresi atau tidak diteliti.
4. T-statistic = 112.18 dengan p-value < 2e-16 yang artinya H0 ditolak atau variabel House Size berpengaruh signifikan terhadap variabel Price begitupun variabel Numbers of Bedrooms dah House Age berpengaruh terhadap variabel Price karena p-value < 0,05. Sedangkan variabel City dan Toll Accessmemiliki p-value yang lebih besar dari 0.05 yang berarti variabel tersebut tidak berpengaruh signifikan terhadap variabel Price.
Untuk memperbaiki persamaan regresi diatas saya lakukan pemodelan yang lain dengan menggunakan variabel city dan toll_access sebagai data numerik.

modelll <- lm(price ~ house_size + num_bedrooms + city + toll_access + house_age, data = dataa)
summary(modelll)

## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + toll_access + 
##     house_age, data = dataa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2840.14  -383.40     0.02   386.11  2498.64 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  113.3906    22.3450   5.075 3.92e-07 ***
## house_size    14.7831     0.1318 112.161  < 2e-16 ***
## num_bedrooms  10.2222     3.5642   2.868  0.00413 ** 
## city          -0.4193     3.0960  -0.135  0.89228    
## toll_access   11.2365     8.7054   1.291  0.19680    
## house_age     -4.5933     1.0041  -4.574 4.80e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 615.6 on 20001 degrees of freedom
## Multiple R-squared:  0.598,  Adjusted R-squared:  0.5979 
## F-statistic:  5950 on 5 and 20001 DF,  p-value: < 2.2e-16

Dari pemodelan regresi berganda didapat:
1. Persamaan Regresi : Y = 112.23 + 14.78house_size + 10.22num_bedrooms - 0.42city + 11.24toll_access - 4.60house_age, Y(Price).
2. F-statistic = 5950 dengan p-value < 2.2e-16 yang artinya H0 ditolak atau terdapat minimal satu variabel bebas yang memiliki pengaruh yang signifikan antara variabel bebas (House Size, Numbers of Bedrooms, City, Toll Access, dan House Age) dengan variabel terikat (Price).
3. R-Squared = 0.5981 yang artinya variabel bebas mampu menjelaskan varians variabel bebas sebesar 59,81% sisanya 40,19% dijelaskan oleh faktor lain yang tidak terdapat pada model regresi atau tidak diteliti.
4. T-statistic = 112.18 dengan p-value < 2e-16 yang artinya H0 ditolak atau variabel House Size berpengaruh signifikan terhadap variabel Price begitupun variabel Numbers of Bedrooms dah House Age berpengaruh terhadap variabel Price karena p-value < 0,05. Sedangkan variabel City dan Toll Accessmemiliki p-value yang lebih besar dari 0.05 yang berarti variabel tersebut tidak berpengaruh signifikan terhadap variabel Price.
Secara output tidak jauh berbeda jika menggunakan data kategorik, akan tetapi dengan menggunakan data numerik persamaan regresi yang dihasilkan lebih mewakili untuk keseluruhan variabel city dan toll_access

# e. Check for multicollinearity using the Variance Inflation Factor (VIF) to assess the correlation between independent variables.
vif_values <- vif(modell)
vif_values

##                  GVIF Df GVIF^(1/(2*Df))
## house_size   2.275940  1        1.508622
## num_bedrooms 2.275762  1        1.508563
## city         1.000453  4        1.000057
## toll_access  1.000244  1        1.000122
## house_age    1.000574  1        1.000287

Berdasarkan nilai dari VIF >10 untuk masing-masing variabel, maka tidak terdapat multikolinearitas. Umumnya, nilai VIF lebih dari 5 atau 10 berarti terdapat multikolinearitas.

# f. Perform diagnostic tests for heteroskedasticity using the Breusch-Pagan test and for linearity using the Rainbow test.

# 1. Perform Breusch-Pagan test for heteroskedasticity
bp_test <- bptest(modell)
bp_test

## 
##  studentized Breusch-Pagan test
## 
## data:  modell
## BP = 1573.4, df = 8, p-value < 2.2e-16

modellll <- lm(price ~ log(house_size) + log(num_bedrooms) + city + toll_access + log(house_age), data = dataa)               #transformasi log

## Warning in log(house_size): NaNs produced

modelllll <- lm(price ~ house_size^2 + num_bedrooms^2 + city + toll_access + house_age^2, data = dataa)                      #transformasi kuadrat
bp_testt <- bptest(modellll)
bp_testtt <- bptest(modelllll)
bp_testt

## 
##  studentized Breusch-Pagan test
## 
## data:  modellll
## BP = 207.34, df = 5, p-value < 2.2e-16

bp_testtt

## 
##  studentized Breusch-Pagan test
## 
## data:  modelllll
## BP = 1574.1, df = 5, p-value < 2.2e-16

# 2. Perform Rainbow test for linearity
rainbow_test <- raintest(modell, fraction = 0.7, order.by = NULL, center = NULL,
   data=list())
rainbow_test

## 
##  Rainbow test
## 
## data:  modell
## Rain = 0.9894, df1 = 6003, df2 = 13995, p-value = 0.6862

Berdasarkan uji asumsi heteroskedastisitas dengan Breusch-Pagan Test didapat
H0 : Tidak terdapat heteroskedastisitas atau variansi residu konstan pada garis prediksi
H1 : Terdapat heteroskedastisitas atau variansi residu tidak konstan pada garis prediksi
BP-Statistic = 1573.4 dengan p-value < 2.2e-16 yang artinya H0 ditolak atau terdapat heteroskedastisitas atau variansi residu tidak konstan pada garis prediksi. Jika residu tidak konstan maka model regresi perlu di transformasi namun, setelah ditransformasi dengan log dan kuadratik tetap p-value < 2.2e-16.

Berdasarkan uji asumsi linearitas dengan Rainbow Test didapat
H0 : Variabel bebas memiliki hubungan yang linear dengan variabel terikat.
H1 : Variabel bebas memiliki hubungan yang tidak linear dengan variabel terikat.
Rain-Statistic = 0.9894 dengan p-value = 0.6862 lebih besar dari alpha (0,05) yang artinya H0 diterima atau variabel bebas memiliki hubungan yang linear dengan variabel terikat.

# g. Create diagnostic plots to assess the model's assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.
# Create diagnostic plots
par(mfrow = c(2, 2))  # Set up a 2x2 grid for plots

# 1. Residuals vs Fitted Values Plot
plot(modell, which = 1)

# 2. Normal Q-Q Plot
plot(modell, which = 2)

# 3. Scale-Location Plot (Squared Root of Standardized Residuals vs Fitted Values)
plot(modell, which = 3)

# 4. Residuals vs Leverage Plot
plot(modell, which = 4)

# Reset the plotting layout to default
par(mfrow = c(1, 1))

Dari beberapa plot diatas disimpulkan bahwa:
1. Plot Residuals vs Fitted menunjukkan tidak terdapat pola yang artinya model tersebut memiliki hubungan linear.
2. Plot Normal Q-Q menunjukkan titik-titik residu mengikuti garis lurus yang artinya model tersebut berdistribusi normal.
3. Plot Scale-Location menunjukkan variansi residu merata di sepanjang garis prediksi yang artinya model tersebut homogen.
4. Plot Cook’s distance menunjukkan terdapat titik ekstrim pada (#13558, #15500, dan #18607) namun, hal ini tidak berarti terdapat outlier karena tidak lebih dari 3 titik menurut (James et al, 2014).