Kontak \(\downarrow\)
Email
Instagram https://www.instagram.com/cvnopp_/
RPubs https://rpubs.com/calvinriswandy/
Nama Calvin Riswandi
NIM 20214920003

1 soal pertama

1.1 membuat 5 macam variabel independent: expenditure, its product price, future value, tax, and interest rate.

1.2 Membuat variabel dependent, dengan menggunakan hubungan linier dengan variabel bebas.

sales_revenue <- round(rnorm(obs, mean=100, sd=15) * future_value, 2)
data <- cbind(data, sales_revenue)
datatable(data)

1.3 Mencocokan model regresi berganda dan membuat summary

# d. Mencocokan model regresi berganda 
model <- lm(sales_revenue ~ expenditure + product_price + future_value + tax + interest_rate, data = data)

# e. Summary
summary(model)
## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6175.5  -839.0    11.6   855.9  6272.0 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -104.5514   315.7783  -0.331    0.741    
## expenditure      0.1211     0.5454   0.222    0.824    
## product_price    1.7735    12.2084   0.145    0.885    
## future_value    97.6927     7.4052  13.192   <2e-16 ***
## tax             13.0392    15.2288   0.856    0.392    
## interest_rate 1075.8715  2790.1485   0.386    0.700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1362 on 9997 degrees of freedom
## Multiple R-squared:  0.7349, Adjusted R-squared:  0.7348 
## F-statistic:  5543 on 5 and 9997 DF,  p-value: < 2.2e-16

Dari summary di atas bisa kita simpulkan bahwa model yang dibuat bisa menjelaskan bahwa variabel dependent dipengaruhi sebesar 73,48% oleh variabel independent.Lalu di dapat model persamaan regresinya : Y = -104.55 + 0,12 expenditure + 1.78 product_price + 97.7 future_value + 13.03 tax +1075.9 interest_rate. Didapatkan juga bahwa F-statistik = 5543 dan p-value < 2.2e-16 yang artinya H0 ditolak dimana terdapat minimal satu variabel bebas.

# f. Plot residu terhadap nilai yang dipasang untuk memeriksa heteroskedastisitas (varians tidak sama) dan nonlinier.
library(broom)
library(DT)
library(ggplot2)
resid <- augment(model)

ggplot(model, aes(.fitted, .resid)) +
  geom_point() +
  stat_smooth(method = lm, se=FALSE) +
  geom_segment(aes(xend=.fitted, yend=.resid), color = "yellow", size=0.3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

Berdasarkan plot diatas, plot tersebut tidak menunjukanbahwa tidak membentuk pola tertentu yang artinya model regresi memiliki hubungan yang linier.

# g. Plot diagnostik untuk menilai lebih lanjut asumsi regresi linier, termasuk normalitas residu, varian konstan, dan tidak adanya outlier yang berpengaruh.

# Create diagnostic plots
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggfortify)
par(mfrow = c(2, 2))
autoplot(model)

1. Plot Residuals vs Fitted menunjukkan tidak terdapat pola yang artinya model tersebut memiliki hubungan linear. 2. Plot Normal Q-Q menunjukkan titik-titik residu mengikuti garis lurus yang artinya model tersebut berdistribusi normal. 3. Plot Scale-Location menunjukkan variansi residu merata di sepanjang garis prediksi yang artinya model tersebut homogen. 4. Plot Cook’s distance menunjukkan outlayer, outlayer yang terlihat 3 outlayer yaitu #7143, #7446, dan #7670.

2 Soal kedua

2.1 Investigate the factors influencing housing prices as the following instructions:

# a. Simulate a hypothetical dataset with 20003 observations containing variables such as house size, number of bedrooms, city (five cities), toll access (yes or no), age of the house, and price.
set.seed(123)
obss <- 20003

house_size <- round(rnorm(obss, mean=1000, sd=500), 2)
num_bedrooms <- ifelse(house_size >=200, sample(7:9, obss, replace = TRUE), 
                       ifelse(house_size >= 100, sample(4:6, obss, replace = TRUE), sample(1:3, obss, replace = TRUE)))
city <- sample(c("Jakarta", "Tangerang Selatan", "Bandung", "Sukabumi", "Bekasi"), obss, replace = TRUE)
toll_access <- sample(c(0:1), obss, replace = TRUE)
house_age <- sample(1:15, obss, replace = TRUE)
price <- round(rnorm(obss, mean=100, sd=50) + 
               rnorm(obss, mean=15, sd=5) * house_size + 
               rnorm(obss, mean=7, sd=2) * num_bedrooms - 
               rnorm(obss, mean=5, sd=1) * house_age + 
               ifelse(toll_access == 1, 2.5, 0) + 
               ifelse(city == "Jakarta", 4.5, 
                      ifelse(city == "Tangerang Selatan", 4.5, 
                             ifelse(city == "Bandung", 2.5, 
                                    ifelse(city == "Sukabumi", 3, 2.7)))),2)


# Create data frame
dataa <- data.frame(house_size, num_bedrooms, city, toll_access, house_age, price)
datatable(dataa)
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

2.2 Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.

# b. Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.
modell <- lm(price ~ house_size + num_bedrooms + city + house_age, data = dataa)
modell
## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + house_age, 
##     data = dataa)
## 
## Coefficients:
##           (Intercept)             house_size           num_bedrooms  
##               314.261                 14.835                  5.305  
##            cityBekasi            cityJakarta           citySukabumi  
##                90.549                 20.007                -18.625  
## cityTangerang Selatan              house_age  
##               106.781                -15.981

2.3 Convert the “city” and “toll access” variable to a factor to treat it as a categorical variable.

# c. Convert the "city" and “toll access” variable to a factor to treat it as a categorical variable.
dataa$city <- factor(dataa$city)
dataa$toll_access <- factor(dataa$toll_access)
datatable(dataa)
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
# d. Summarize the fitted regression model to analyze the coefficients, standard errors, t-values, and p-values.
modell <- lm(price ~ house_size + num_bedrooms + city + toll_access + house_age, data = dataa)
summary(modell)
## 
## Call:
## lm(formula = price ~ house_size + num_bedrooms + city + toll_access + 
##     house_age, data = dataa)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42400  -2801    -67   2846  34054 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            398.46593  247.35656   1.611   0.1072    
## house_size              14.83476    0.08732 169.888   <2e-16 ***
## num_bedrooms             4.24033   30.16567   0.141   0.8882    
## cityBekasi              86.92570  126.43301   0.688   0.4918    
## cityJakarta             16.20593  126.76741   0.128   0.8983    
## citySukabumi           -22.33741  126.35148  -0.177   0.8597    
## cityTangerang Selatan  104.07858  126.33675   0.824   0.4101    
## toll_access1          -145.21510   80.00680  -1.815   0.0695 .  
## house_age              -16.05783    9.21158  -1.743   0.0813 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5655 on 19994 degrees of freedom
## Multiple R-squared:  0.6328, Adjusted R-squared:  0.6327 
## F-statistic:  4307 on 8 and 19994 DF,  p-value: < 2.2e-16

Dari summary di atas bisa kita simpulkan bahwa model yang dibuat bisa menjelaskan bahwa variabel dependent dipengaruhi sebesar 63,27% oleh variabel independent. Lalu di dapat model persamaan regresinya : Y = 398.46 + 14.83 house_size + 4.24 num_bedrooms + 86.9 cityBekasi + 16.20 cityJakarta - 22.33741 citySukabumi + 104.07 cityTangerang Selatan - 145.21 toll_access1 - 16.06 house_age. Didapatkan juga bahwa F-statistik = 4307 dan p-value < 2.2e-16 yang artinya H0 ditolak dimana terdapat minimal satu variabel bebas.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# e. Check for multicollinearity using the Variance Inflation Factor (VIF) to assess the correlation between independent variables.
vif_values <- vif(modell)
vif_values
##                  GVIF Df GVIF^(1/(2*Df))
## house_size   1.192478  1        1.092006
## num_bedrooms 1.192990  1        1.092241
## city         1.000851  4        1.000106
## toll_access  1.000852  1        1.000426
## house_age    1.000245  1        1.000122

dari hasil di atas tidak ada indikasi multikolinieritas yang signifikan dalam model regresi dan semua variabel indepeden di anggap tidak berkolerasi kuat satu sama lain, yang artinya kolerasinya lemah.

# f. Perform diagnostic tests for heteroskedasticity using the Breusch-Pagan test and for linearity using the Rainbow test.
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# 1. Perform Breusch-Pagan test for heteroskedasticity
bp_test <- bptest(modell)
bp_test
## 
##  studentized Breusch-Pagan test
## 
## data:  modell
## BP = 3241, df = 8, p-value < 2.2e-16
# 2. Perform Rainbow test for linearity
rainbow_test <- raintest(modell, fraction = 0.7, order.by = NULL, center = NULL,
   data=list())
rainbow_test
## 
##  Rainbow test
## 
## data:  modell
## Rain = 1.031, df1 = 6001, df2 = 13993, p-value = 0.08014

Dalam Breusch-Pagan Test, nilai p-value < 2.2e-16, yang menunjukkan adanya heteroskedastisitas dalam model regresi.

untuk uji rainbow test p-value > 0,05 yang menunjukan bahwa H0 ditolak, yang dimana artinya bahwa hubungan antara variabel independen dan dependen adalah linier.

# g. Create diagnostic plots to assess the model's assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.
# Create diagnostic plots

library(dplyr)
library(ggfortify)
par(mfrow = c(2, 2))
autoplot(modell)

  1. Plot Residuals vs Fitted menunjukkan tidak terdapat pola yang artinya model tersebut memiliki hubungan linear.
  2. Plot Normal Q-Q menunjukkan titik-titik residu mengikuti garis lurus yang artinya model tersebut berdistribusi normal.
  3. Plot Scale-Location menunjukkan variansi residu merata di sepanjang garis prediksi yang artinya model tersebut homogen.
  4. Plot Cook’s distance menunjukkan outlayer, outlayer yang terlihat 3 outlayer yaitu #2982, #6736, dan #18637.