| Kontak | \(\downarrow\) |
| calvin.riswandi@gmail.com | |
| https://www.instagram.com/cvnopp_/ | |
| RPubs | https://rpubs.com/calvinriswandy/ |
| Nama | Calvin Riswandi |
| NIM | 20214920003 |
sales_revenue <- round(rnorm(obs, mean=100, sd=15) * future_value, 2)
data <- cbind(data, sales_revenue)
datatable(data)# d. Mencocokan model regresi berganda
model <- lm(sales_revenue ~ expenditure + product_price + future_value + tax + interest_rate, data = data)
# e. Summary
summary(model)##
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value +
## tax + interest_rate, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6175.5 -839.0 11.6 855.9 6272.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -104.5514 315.7783 -0.331 0.741
## expenditure 0.1211 0.5454 0.222 0.824
## product_price 1.7735 12.2084 0.145 0.885
## future_value 97.6927 7.4052 13.192 <2e-16 ***
## tax 13.0392 15.2288 0.856 0.392
## interest_rate 1075.8715 2790.1485 0.386 0.700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1362 on 9997 degrees of freedom
## Multiple R-squared: 0.7349, Adjusted R-squared: 0.7348
## F-statistic: 5543 on 5 and 9997 DF, p-value: < 2.2e-16
Dari summary di atas bisa kita simpulkan bahwa model yang dibuat bisa menjelaskan bahwa variabel dependent dipengaruhi sebesar 73,48% oleh variabel independent.Lalu di dapat model persamaan regresinya : Y = -104.55 + 0,12 expenditure + 1.78 product_price + 97.7 future_value + 13.03 tax +1075.9 interest_rate. Didapatkan juga bahwa F-statistik = 5543 dan p-value < 2.2e-16 yang artinya H0 ditolak dimana terdapat minimal satu variabel bebas.
# f. Plot residu terhadap nilai yang dipasang untuk memeriksa heteroskedastisitas (varians tidak sama) dan nonlinier.
library(broom)
library(DT)
library(ggplot2)
resid <- augment(model)
ggplot(model, aes(.fitted, .resid)) +
geom_point() +
stat_smooth(method = lm, se=FALSE) +
geom_segment(aes(xend=.fitted, yend=.resid), color = "yellow", size=0.3)## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
Berdasarkan plot diatas, plot tersebut tidak menunjukanbahwa tidak
membentuk pola tertentu yang artinya model regresi memiliki hubungan
yang linier.
# g. Plot diagnostik untuk menilai lebih lanjut asumsi regresi linier, termasuk normalitas residu, varian konstan, dan tidak adanya outlier yang berpengaruh.
# Create diagnostic plots
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggfortify)
par(mfrow = c(2, 2))
autoplot(model)
1. Plot Residuals vs Fitted menunjukkan tidak terdapat pola yang artinya
model tersebut memiliki hubungan linear. 2. Plot Normal Q-Q menunjukkan
titik-titik residu mengikuti garis lurus yang artinya model tersebut
berdistribusi normal. 3. Plot Scale-Location menunjukkan variansi residu
merata di sepanjang garis prediksi yang artinya model tersebut homogen.
4. Plot Cook’s distance menunjukkan outlayer, outlayer yang terlihat 3
outlayer yaitu #7143, #7446, dan #7670.
# a. Simulate a hypothetical dataset with 20003 observations containing variables such as house size, number of bedrooms, city (five cities), toll access (yes or no), age of the house, and price.
set.seed(123)
obss <- 20003
house_size <- round(rnorm(obss, mean=1000, sd=500), 2)
num_bedrooms <- ifelse(house_size >=200, sample(7:9, obss, replace = TRUE),
ifelse(house_size >= 100, sample(4:6, obss, replace = TRUE), sample(1:3, obss, replace = TRUE)))
city <- sample(c("Jakarta", "Tangerang Selatan", "Bandung", "Sukabumi", "Bekasi"), obss, replace = TRUE)
toll_access <- sample(c(0:1), obss, replace = TRUE)
house_age <- sample(1:15, obss, replace = TRUE)
price <- round(rnorm(obss, mean=100, sd=50) +
rnorm(obss, mean=15, sd=5) * house_size +
rnorm(obss, mean=7, sd=2) * num_bedrooms -
rnorm(obss, mean=5, sd=1) * house_age +
ifelse(toll_access == 1, 2.5, 0) +
ifelse(city == "Jakarta", 4.5,
ifelse(city == "Tangerang Selatan", 4.5,
ifelse(city == "Bandung", 2.5,
ifelse(city == "Sukabumi", 3, 2.7)))),2)
# Create data frame
dataa <- data.frame(house_size, num_bedrooms, city, toll_access, house_age, price)
datatable(dataa)## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
# b. Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.
modell <- lm(price ~ house_size + num_bedrooms + city + house_age, data = dataa)
modell##
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + house_age,
## data = dataa)
##
## Coefficients:
## (Intercept) house_size num_bedrooms
## 314.261 14.835 5.305
## cityBekasi cityJakarta citySukabumi
## 90.549 20.007 -18.625
## cityTangerang Selatan house_age
## 106.781 -15.981
# c. Convert the "city" and “toll access” variable to a factor to treat it as a categorical variable.
dataa$city <- factor(dataa$city)
dataa$toll_access <- factor(dataa$toll_access)
datatable(dataa)## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
# d. Summarize the fitted regression model to analyze the coefficients, standard errors, t-values, and p-values.
modell <- lm(price ~ house_size + num_bedrooms + city + toll_access + house_age, data = dataa)
summary(modell)##
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + toll_access +
## house_age, data = dataa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42400 -2801 -67 2846 34054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 398.46593 247.35656 1.611 0.1072
## house_size 14.83476 0.08732 169.888 <2e-16 ***
## num_bedrooms 4.24033 30.16567 0.141 0.8882
## cityBekasi 86.92570 126.43301 0.688 0.4918
## cityJakarta 16.20593 126.76741 0.128 0.8983
## citySukabumi -22.33741 126.35148 -0.177 0.8597
## cityTangerang Selatan 104.07858 126.33675 0.824 0.4101
## toll_access1 -145.21510 80.00680 -1.815 0.0695 .
## house_age -16.05783 9.21158 -1.743 0.0813 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5655 on 19994 degrees of freedom
## Multiple R-squared: 0.6328, Adjusted R-squared: 0.6327
## F-statistic: 4307 on 8 and 19994 DF, p-value: < 2.2e-16
Dari summary di atas bisa kita simpulkan bahwa model yang dibuat bisa menjelaskan bahwa variabel dependent dipengaruhi sebesar 63,27% oleh variabel independent. Lalu di dapat model persamaan regresinya : Y = 398.46 + 14.83 house_size + 4.24 num_bedrooms + 86.9 cityBekasi + 16.20 cityJakarta - 22.33741 citySukabumi + 104.07 cityTangerang Selatan - 145.21 toll_access1 - 16.06 house_age. Didapatkan juga bahwa F-statistik = 4307 dan p-value < 2.2e-16 yang artinya H0 ditolak dimana terdapat minimal satu variabel bebas.
library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# e. Check for multicollinearity using the Variance Inflation Factor (VIF) to assess the correlation between independent variables.
vif_values <- vif(modell)
vif_values## GVIF Df GVIF^(1/(2*Df))
## house_size 1.192478 1 1.092006
## num_bedrooms 1.192990 1 1.092241
## city 1.000851 4 1.000106
## toll_access 1.000852 1 1.000426
## house_age 1.000245 1 1.000122
dari hasil di atas tidak ada indikasi multikolinieritas yang signifikan dalam model regresi dan semua variabel indepeden di anggap tidak berkolerasi kuat satu sama lain, yang artinya kolerasinya lemah.
# f. Perform diagnostic tests for heteroskedasticity using the Breusch-Pagan test and for linearity using the Rainbow test.
library(lmtest)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# 1. Perform Breusch-Pagan test for heteroskedasticity
bp_test <- bptest(modell)
bp_test##
## studentized Breusch-Pagan test
##
## data: modell
## BP = 3241, df = 8, p-value < 2.2e-16
# 2. Perform Rainbow test for linearity
rainbow_test <- raintest(modell, fraction = 0.7, order.by = NULL, center = NULL,
data=list())
rainbow_test##
## Rainbow test
##
## data: modell
## Rain = 1.031, df1 = 6001, df2 = 13993, p-value = 0.08014
Dalam Breusch-Pagan Test, nilai p-value < 2.2e-16, yang menunjukkan adanya heteroskedastisitas dalam model regresi.
untuk uji rainbow test p-value > 0,05 yang menunjukan bahwa H0 ditolak, yang dimana artinya bahwa hubungan antara variabel independen dan dependen adalah linier.
# g. Create diagnostic plots to assess the model's assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.
# Create diagnostic plots
library(dplyr)
library(ggfortify)
par(mfrow = c(2, 2))
autoplot(modell)