Ekonometrics

Midterm Exam

Kontak	: \(\downarrow\)
Email	ali.19arifin@gmail.com
Instagram	https://www.instagram.com/arifin.alicia/
RPubs	https://rpubs.com/aliciaarifin/
Nama	Alicia Arifin
NIM	20214920001
Prodi	Statistika, 2021

Note: 1. jika hasil coding dnegan tulisan berbeda, itu dikarenan setiap di run, akan terbentuk sample random 2. alpha yang digunakan adalah 0.05 atau selang kepercayaan 95% \[\alpha=0.05\]

Number 1

Analyze the relationship between a company’s advertising expenditure, its product price, future value, tax, interest rate, and its sales revenue. Follow the instruction below:

a

Generate hypothetical data for 100* observations.
Note: replace the * sign with the last two digits of your Student ID number

n=10001 # last 2 digit ID number 01
set.seed(123) # random sampling in room number 123

b

Create five independent variables: expenditure, its product price, future value, tax, and interest rate.

expenditure = rnorm(n=n, mean = 212, sd=32)
product_price = rnorm(n=n, mean= 350, sd=18)
future_value = product_price * rnorm(n, mean= 0.09, sd= 0.02)
tax = product_price * ifelse(product_price > 350, 0.15, 0.11)
interest_rate = rnorm(n=n, mean= 0.11,sd= 0.03)

c

Generate a dependent variable, sales revenue, using a linear relationship with the independent variables.

sales_revenue = (product_price * rnorm(n, mean= 0.50, sd=0.09) )

d

Fit a multiple regression model where dependent variables are regressed to the independent variables.

linear_regression1= lm(sales_revenue~ expenditure+product_price+future_value+tax+interest_rate)

e

Print a summary of the regression results, which includes coefficients, standard errors, t-statistics, p-values, and R-squared.

summary(linear_regression1)

## 
## Call:
## lm(formula = sales_revenue ~ expenditure + product_price + future_value + 
##     tax + interest_rate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -115.584  -21.265    0.176   21.020  115.333 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.288154  10.379032   0.220   0.8255    
## expenditure   -0.013233   0.009867  -1.341   0.1799    
## product_price  0.488115   0.037305  13.085   <2e-16 ***
## future_value   0.088480   0.045010   1.966   0.0494 *  
## tax            0.053597   0.074504   0.719   0.4719    
## interest_rate -0.614442  10.519755  -0.058   0.9534    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.53 on 9995 degrees of freedom
## Multiple R-squared:  0.08141,    Adjusted R-squared:  0.08095 
## F-statistic: 177.2 on 5 and 9995 DF,  p-value: < 2.2e-16

Dari hasil regresi linear di atas, variabel independen yang secara statistik siknifikan adalah product price (p-value < 0.05). P-value regresinya < alpha yang berarti secara statistik signifikan, modelnya sudah oke untuk digunakan. Tetapi jika dilihat dari Multiple R-squarenya nilainya sangat kecil (<10%). yang berarti kurangdari 10% dari variasi variabel independen yang bisa dijelaskan dalam regresinya.
Rumus regresi yang dihasilkan adalah

linear_regression1$coefficients

##   (Intercept)   expenditure product_price  future_value           tax 
##    2.28815437   -0.01323322    0.48811493    0.08848033    0.05359736 
## interest_rate 
##   -0.61444166

Sales Revenue = 2.29 - 0.013 expenditure + 0.488 product price+ 0.088 future value + 0.053 tax - 0.61 interest rate

Note:
++ siknifikan atau tidak dari suatu variabel bisa dilihat dari tanda * yang di sebelah kanan. JIka *** maka berpengaruh siknifikan di alpha 0.000, ** alpha 0.001, * alpha 0.001,dan \[.\] alpha 0.05. ++ Angka yang keluar akan random setiap kali dirun, sehingga adanya ketidakcocokan antara tulisan dengan coding.

f

Plot the residuals against the fitted values to check for heteroscedasticity (unequal variance) and nonlinearity.

plot(linear_regression1, which = 1, col="green")

dilihat dari plot residual vs fitted, garis tengahnya berada di sekitaran 0. maka dari itu datanya linear. Ketikda dilihat dari kehomogenannya, data tidak memiliki sebaran yang berpola. Sehingga data diasumsikan homogen.

g

Plot diagnostic plots to further assess the assumptions of linear regression, including normality of residuals, constant variance, and absence of influential outliers.

library(ggplot2)
library(ggfortify)
library(broom)
model1_diagram <- augment(linear_regression1)

par(mfrow=c(2,2))
autoplot(linear_regression1)

Pada Plot Residual dengan fitted values, tidak memiliki pattern atau pola. maka dari itu data memiliki hubungan linear antara variabel dependen dan variabel independen.

library(tseries)
jarque.bera.test(residuals(linear_regression1))

## 
##  Jarque Bera Test
## 
## data:  residuals(linear_regression1)
## X-squared = 0.64244, df = 2, p-value = 0.7253

Normal QQ Plot digunakan untuk mengetahui apakah residu dari regresi linear berdistribusi normal atau tidak. semakin plot membentuk linear, maka data berdistribusi normal. Kalau dilihat dari plot, data berdistribusi normal. untuk memastikan apakah normal atau tidak, kita bisa menggunakan test Jarque Bera untuk mengecek normalitas. H0 dari test Jarque Bera test adalah berdistribusi normal, jika \[ p-value < \alpha\], H0 ditolak. Hasil p-value dari test Jarque bera lebih dari \[\alpha (0.05)\]. Maka dari itu kita bisa katakan bahwa residu dari regresi linear berdistribusi normal.
Pada Plot scale location, plot tersebut meenjelaskan tentang apakah residual bersebar rata atau tidak diantara range prediktor. Pada plot ini, bisa dilihat bahwa garis tengahny semakin besar fitted valuesnya, semakin naik juga garis horizontalnya. atau bisa dikatakan residualnya tidak terbagi 2 sisi dengan sempurna. Dari hal tersebut bisa disimpulkan bahwa varians dari residual meningkat seiring dengan fitted valuesnya, dan bisa dikatakan varians dari residunya tidak konstan terhadap varians.
pada plot Residuals vs leverage, kita bisa melihat outlier dan leverage. Data dikatakan outlier ketika plot residual vs levergaenya, standarized residuals nya berada di atas 3 memiliki kemungkinan oulier (pencilan). Dari plot, terdapat observasi yang standarized residualsnya di atas 3, dan observasi tersebut termasuk pencilan. Pada plot terlihat lumayan banyak pencilan.

par(mfrow= c(1,2))
plot(linear_regression1,4)
plot(linear_regression1,5)

Saat dilihat dari plot Cook’s Distance, pencilan ada 3, yaitu data ke-57720, 7283 dan 7725. masih termasuk aman outliernya. nilai kritis cook’s distance adalah 2(p+1)n dengan p adalah jumlah prediktor, dan n adalah jumlah observasi.

Number 2

Investigate the factors influencing housing prices as the following instructions: ## a. Simulate a hypothetical dataset with 200* observations containing variables such as house size, number of bedrooms, city (five cities), toll access (yes or no), age of the house, and price.

n2 = 20001
house_size =  rnorm(n2, mean=200, sd=50)
bedroom = ifelse(house_size <120, round(rnorm(n2, mean=1, sd=1)), round( rnorm(n2,mean = 3, sd=1)))
city = c(rep(times =4000, "New York"), rep(times =4000, "Chicago"),rep(times =4000, "Atlanta"),rep(times =4000, "Los Angeles"),rep(times =4001, "Denver") )
city2 = ifelse(city=="New York",1,
               ifelse(city=="Chicago",2,
                      ifelse(city=="Atlanta",3,
                             ifelse(city=="Los Angeles",4,5))))
toll_access = rbinom(n2, size=1,prob=0.5)
toll_access1 = toll_access
toll_access = ifelse(toll_access ==1, "yes","no")
house_age = round(rnorm(n2, mean =10, sd=5))
price = rnorm(n2, 30000, sd=6000) + bedroom * rnorm(n2, mean=2000, sd=500) + ifelse(toll_access==1, rnorm(n2, mean=10000, sd=2000), rnorm(n2, mean=4500, sd=500)) + house_age * ifelse(house_age <6, rnorm(n2, 5000, sd=500), rnorm(n2, mean= 2200, sd=90))

b

Fit a multiple regression model using the lm() function, where the price of the house is the dependent variable, and house size, number of bedrooms, city, and age are the independent variables.

linear_model2 = lm(price~ house_size+bedroom+city2+toll_access1+house_age)

c

Convert the “city” and “toll access” variable to a factor to treat it as a categorical variable.

city = as.factor(city2)
toll_access = as.factor(toll_access)

Pada analisis ini, kita akan bandingkan mana yang lebih bagus modelnya ketik city dan toll_accessnya difactorkan atau tidak.

d

Summarize the fitted regression model to analyze the coefficients, standard errors, t-values, and p-values.
untuk menyimpulkan regresi linear kita akan menggunakan summary().

# ketika city dan toll_access dijadikan faktor
summary(lm(price~ house_size+bedroom+city+toll_access+house_age)) #Regression1

## 
## Call:
## lm(formula = price ~ house_size + bedroom + city + toll_access + 
##     house_age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40093  -4894   -179   4687  33585 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    39048.7478   282.7665 138.095   <2e-16 ***
## house_size        -0.4028     1.0679  -0.377    0.706    
## bedroom         2012.2068    47.2379  42.597   <2e-16 ***
## city2             15.6066   165.6290   0.094    0.925    
## city3           -141.8749   165.6381  -0.857    0.392    
## city4            -53.1471   165.6486  -0.321    0.748    
## city5           -134.3457   165.6267  -0.811    0.417    
## toll_accessyes  -156.1340   104.7590  -1.490    0.136    
## house_age       1900.1723    10.3806 183.051   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7407 on 19992 degrees of freedom
## Multiple R-squared:  0.6398, Adjusted R-squared:  0.6397 
## F-statistic:  4440 on 8 and 19992 DF,  p-value: < 2.2e-16

Hasil summary regresi linear berganda ketika ada yang difaktorkan adalah : 1. nilai p-value < alpha 0.05, regresi tersebut siknifikan dan sudah oke jika ingin digunakan 2. Nilai Multiple R-Square sebesar 0.6382 yang berarti 63.82% variabel independen masuk ke dalam model. 3. Variabel independen yang siknifikan adalah bedroom(jumlah kamar tidur) dan house age(umur rumah). + Pada coding di bawah akan kita coba meregresikan hanya menggunakan variabel independen yang memiliki pengaruh siknifikan.

summary(lm(price~ bedroom+house_age)) # Regression2

## 
## Call:
## lm(formula = price ~ bedroom + house_age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40245  -4887   -187   4671  33709 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38837.46     176.73  219.76   <2e-16 ***
## bedroom      2008.11      46.26   43.41   <2e-16 ***
## house_age    1900.27      10.38  183.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7407 on 19998 degrees of freedom
## Multiple R-squared:  0.6398, Adjusted R-squared:  0.6397 
## F-statistic: 1.776e+04 on 2 and 19998 DF,  p-value: < 2.2e-16

Hasil summary regresi linear berganda seterlah dilanjutkan adalah : 1. nilai p-value < alpha 0.05, regresi tersebut siknifikan dan sudah oke. 2. Nilai Multiple R-Square sebesar 0.6381 yang berarti 63.81% variabel independen masuk ke dalam model. Model regresi 1 lebih bagus daripada regresi2. Karena nilai multiple r-square dari regresi1 adlah 0.6382, dan perbedaannya hanya 0.01% maka untuk regresi ini kita akan menggunakan regresi1.

lm(price~ house_size+bedroom+city+toll_access+house_age)$coefficients

##    (Intercept)     house_size        bedroom          city2          city3 
##  39048.7477866     -0.4028098   2012.2068338     15.6065632   -141.8748546 
##          city4          city5 toll_accessyes      house_age 
##    -53.1470885   -134.3457151   -156.1340064   1900.1723398

Maka dari itu formula regresinya adalah: \[price = 38772.55 +0.311house size+ 1977.27bedroom+ 58.46city2 +163.53city3 +4208.16city4 +159.09city5 +0.424toll_access yes+ 1899.21house_age\] Contoh : untuk city, jika city2 saja yang ada, maka city3-5 nya 0.

Lalu, kita akan coba ketika semua variabel merupakan numerik

# ketika city dan toll_access tidak dijadikan faktor (numerik)
summary(linear_model2) # Regression3

## 
## Call:
## lm(formula = price ~ house_size + bedroom + city2 + toll_access1 + 
##     house_age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40172  -4897   -176   4673  33629 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39086.471    284.800 137.242   <2e-16 ***
## house_size      -0.397      1.068  -0.372    0.710    
## bedroom       2012.266     47.234  42.602   <2e-16 ***
## city2          -33.745     37.036  -0.911    0.362    
## toll_access1  -156.596    104.750  -1.495    0.135    
## house_age     1900.138     10.379 183.070   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7407 on 19995 degrees of freedom
## Multiple R-squared:  0.6398, Adjusted R-squared:  0.6397 
## F-statistic:  7104 on 5 and 19995 DF,  p-value: < 2.2e-16

dari hasil kesimpulan regresi linear di atas didapatkan hasil bahwa variabel independen yang memiliki pengaruh signifikan(p-value<0.05) adalah bedroom(jumlah kamar tidur),tol_access1(akses jalan tol) dan house_age(umur rumah). p-value dari regresi<0.05 yang berarti model tersebut signifikan secara statistik(model bagus). Multiple R-squared sebesar 0.645 yang berarti 64.5 % dari variansi variabel independen sudah masuk ke dalam model regresi.
Sekarang kita sudah tahu variabel independen mana yang berpengaruh signifikan secara statistik. Kita akan coba membuat model linear hanya dengan variabel tersebut.price~bedroom+toll_access+house_age

summary(lm(price~bedroom+toll_access+house_age)) # Regression4

## 
## Call:
## lm(formula = price ~ bedroom + toll_access + house_age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40168  -4893   -177   4680  33629 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    38917.21     184.56 210.864   <2e-16 ***
## bedroom         2008.13      46.26  43.411   <2e-16 ***
## toll_accessyes  -157.00     104.75  -1.499    0.134    
## house_age       1900.21      10.38 183.086   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7406 on 19997 degrees of freedom
## Multiple R-squared:  0.6398, Adjusted R-squared:  0.6398 
## F-statistic: 1.184e+04 on 3 and 19997 DF,  p-value: < 2.2e-16

Pada model ini p-value<0.05. Lalu, apakah model ini lebih baik daripada model sebelumnya? Jawabannya adalah belum tentu. hal ini dikarenakan perbedaan multiple R-squared hanya 0.0001. Penentuan model terbaik diukur dari multiple R-squared mana yang lebih besar.
Note:
++ multiple R-squared digunakan ketika menggunakan regresi linear berganda dan adjusted R-squared digunakan ketika regresi linear sederhana. ++ alpha yang digunakan adalah 0.05 atau selang kepercayaan 95% \[\alpha=0.05\]

linear_model2$coefficients

##   (Intercept)    house_size       bedroom         city2  toll_access1 
## 39086.4708335    -0.3970013  2012.2657049   -33.7452503  -156.5958387 
##     house_age 
##  1900.1381894

Maka dari itu formula regresinya adalah: \[price = 386786.233 +0.311house size+ 1977.31bedroom+46.75city-13.47toll_access+ 1899.26house_age\]

Gunakan model yang mana?

Pada kedua cara regresi (variabel independen difaktorkan dan yang tidak) didapatkan hasil regresi1(ada variabel yang difaktorkan) dan regresi3(semua variabel numerik).

# ketika city dan toll_access dijadikan faktor
summary(lm(price~ house_size+bedroom+city+toll_access+house_age)) #Regression1

## 
## Call:
## lm(formula = price ~ house_size + bedroom + city + toll_access + 
##     house_age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40093  -4894   -179   4687  33585 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    39048.7478   282.7665 138.095   <2e-16 ***
## house_size        -0.4028     1.0679  -0.377    0.706    
## bedroom         2012.2068    47.2379  42.597   <2e-16 ***
## city2             15.6066   165.6290   0.094    0.925    
## city3           -141.8749   165.6381  -0.857    0.392    
## city4            -53.1471   165.6486  -0.321    0.748    
## city5           -134.3457   165.6267  -0.811    0.417    
## toll_accessyes  -156.1340   104.7590  -1.490    0.136    
## house_age       1900.1723    10.3806 183.051   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7407 on 19992 degrees of freedom
## Multiple R-squared:  0.6398, Adjusted R-squared:  0.6397 
## F-statistic:  4440 on 8 and 19992 DF,  p-value: < 2.2e-16

# ketika city dan toll_access tidak dijadikan faktor (numerik)
summary(linear_model2) # Regression3

## 
## Call:
## lm(formula = price ~ house_size + bedroom + city2 + toll_access1 + 
##     house_age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40172  -4897   -176   4673  33629 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39086.471    284.800 137.242   <2e-16 ***
## house_size      -0.397      1.068  -0.372    0.710    
## bedroom       2012.266     47.234  42.602   <2e-16 ***
## city2          -33.745     37.036  -0.911    0.362    
## toll_access1  -156.596    104.750  -1.495    0.135    
## house_age     1900.138     10.379 183.070   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7407 on 19995 degrees of freedom
## Multiple R-squared:  0.6398, Adjusted R-squared:  0.6397 
## F-statistic:  7104 on 5 and 19995 DF,  p-value: < 2.2e-16

KESIMPULAN : 1. Kedua p-value dari regresi siknifikan. 2. Gunakan Model regresi3 karena multiple r-squarednya lebih besar dari regresi1. Multiple R-Square nya adalah 0.645, dan regresi1 r-squarenya 0.6344.

e

Check for multicollinearity using the Variance Inflation Factor (VIF) to assess the correlation between independent variables.
Mendeteksi multikolinearitas bisa menggunakan variance inflation factor (vif). VIF akan menghitung korelasi dan kekuatan dari variabel independen. Nilai VIF dimulai dari 1 dan tidak ada nilai maksimal. ++ Nilai 1 = tidak ada korelasi antar variabel independen ++ Nilai 1-5 = memiliki korelasi lemah antar variabel independen, teapi biasanya tidak perlu diperhatikan ++ Nilai lebih dari 5 = memiliki korelasi yang kuat antar variabel independen

library(car)
vif(linear_model2)

##   house_size      bedroom        city2 toll_access1    house_age 
##     1.042374     1.042618     1.000255     1.000038     1.000118

Dari hasil tes VIF di atas, dilihat bahwa semua koefisien VIF nya berada diantara 1-5, yang berarti korelasinya ada sedikit. Dan bisa dikatakan tidak ada multikolinearitas antar variabel.

f

Perform diagnostic tests for heteroskedasticity using the Breusch-Pagan test and for linearity using the Rainbow test.

Heteroskedasticity

Apa itu heteroskedasitas? Heteroskedasitas adalah kondisi standar deviasi dari residu tidak konstan. Hal yang perlu dicapai adalah tidak heteroskedasitas.
Breusch Pagan Test yang digunakan akan menggunakan library lmtest dan coding bptest().

library(lmtest)
bptest(linear_model2)

## 
##  studentized Breusch-Pagan test
## 
## data:  linear_model2
## BP = 1016.5, df = 5, p-value < 2.2e-16

dari hasil tes Breusch Pagan, didapatkan hasil BP =937.22 dan p-value< \[\alpha=0.05\].
H0 diterima, data tidak heteroskedasitas. Sehingga data bisa diinterpretasi seperti biasa.
Note: 1. p-value >alpha, H0 diterima, H0= data homogen 2. p-value <alpha, H0 ditolak, H1 = data heteroskedasitas

y_transform = log(price)
rumah = log(house_size)
kamar = log(bedroom)
kota = log(city2)
tol = log(toll_access1)
umur  = log(house_age)

linear_transform = summary(lm(y_transform~rumah+ bedroom+city2+toll_access1+house_age))
bptest(linear_transform)

## 
##  studentized Breusch-Pagan test
## 
## data:  linear_transform
## BP = 561.98, df = 5, p-value < 2.2e-16

ketika sudah transformasi masih tetap heteroskedasitas, maka seharusnya diperlukan uji regresi berbobot (weighted regression). Karena pada soal ini ditujukan menggunakan regresi linear, maka untuk sekarang diabaikan terlebih dahulu.

Linearity

raintest(linear_model2)

## 
##  Rainbow test
## 
## data:  linear_model2
## Rain = 0.98329, df1 = 10001, df2 = 9994, p-value = 0.8003

H0 = linear, H1 = nonlinear
p-value >0.05, H0 diterima. modelnya linear. cocok untuk dijadikan regresi linear.

g

Create diagnostic plots to assess the model’s assumptions, including residual plots against fitted values, Q-Q plots of residuals, and plots of residuals against leverage.

par(mfrow=c(2,2))
autoplot(linear_model2)

Pada Plot Residual dengan fitted values, tidak memiliki pattern atau pola. Garis nya masih diantara 0, dan dari hasil rainbow test juga dikatakan linear. maka dari itu data memiliki hubungan linear antara variabel dependen dan variabel independen.

library(tseries)
jarque.bera.test(residuals(linear_model2))

## 
##  Jarque Bera Test
## 
## data:  residuals(linear_model2)
## X-squared = 232.52, df = 2, p-value < 2.2e-16

Normal QQ Plot (distribusi normal atau tidak). Kalau dilihat dari plot, data berdistribusi normal. Test Jarque Bera bisa digunakan untuk mengecek normalitas. H0 dari test Jarque Bera test adalah berdistribusi normal, jika \[ p-value < \alpha\], H0 ditolak. Hasil p-value dari test Jarque bera lebih kecil dari \[\alpha (0.05)\]. Maka dari itu kita bisa katakan bahwa residu dari regresi linear tidak berdistribusi normal.
Pada Plot scale location, plot tersebut meenjelaskan tentang apakah residual bersebar rata atau tidak diantara range prediktor. Pada plot ini, bisa dilihat bahwa garisnya bisa dikatakan tidak terbagi 2 sisi dengan sempurna. Dari hal tersebut bisa diasumsikan bahwa varians dari residual meningkat seiring dengan fitted valuesnya, dan bisa dikatakan varians dari residunya tidak konstan terhadap varians.
pada plot Residuals vs leverage, kita bisa melihat outlier dan leverage. Data dikatakan outlier ketika plot residual vs levergaenya, standarized residuals nya berada di atas 3 memiliki kemungkinan oulier (pencilan). Dari plot, terdapat observasi yang standarized residualsnya di atas 3, dan observasi tersebut termasuk pencilan.

par(mfrow=c(1,2))
plot(linear_model2, 4)
plot(linear_model2,5)

jika dilihat dari hasil plot cook’s distance, outlier ada 3 yaitu di data ke 859q, 14268 dan 16974. masih termasuk aman outliernya.