STA1512-Analisis Korelasi dan Regresi

Pengertian

Analisis korelasi digunakan untuk mengukur kekuatan hubungan (hubungan linier) antara dua peubah.
Analisis regresi merupakan alat statistika untuk mengevaluasi hubungan antara satu peubah dengan satu peubah lainnya, atau satu peubah dengan beberapa peubah lainnya.

Korelasi

Nilai dari koefisien korelasi berkisar antara -1 dan 1.
Tanda (+) menunjukkan hubungan searah, sedangkan tanda (-) menunjukkan hubungan berlawanan.

Contoh

library(readxl)
library(lmtest)
datacontoh<- read_excel("D:/MATERI KULIAH S2 IPB/SEMESTER 1/ANALISIS STATISTIKA/dataregresi.xlsx")
datacontoh

## # A tibble: 20 x 5
##    Subject Triceps_X1 Thigh_X2 Midarm_X3 Bodyfat_Y
##      <dbl>      <dbl>    <dbl>     <dbl>     <dbl>
##  1       1       19.5     43.1      29.1      11.9
##  2       2       24.7     49.8      28.2      22.8
##  3       3       30.7     51.9      37        18.7
##  4       4       29.8     54.3      31.1      20.1
##  5       5       19.1     42.2      30.9      12.9
##  6       6       25.6     53.9      23.7      21.7
##  7       7       31.4     58.5      27.6      27.1
##  8       8       27.9     52.1      30.6      25.4
##  9       9       22.1     49.9      23.2      21.3
## 10      10       25.5     53.5      24.8      19.3
## 11      11       31.1     56.6      30        25.4
## 12      12       30.4     56.7      28.3      27.2
## 13      13       18.7     46.5      23        11.7
## 14      14       19.7     44.2      28.6      17.8
## 15      15       14.6     42.7      21.3      12.8
## 16      16       29.5     54.4      30.1      23.9
## 17      17       27.7     55.3      25.7      22.6
## 18      18       30.2     58.6      24.6      25.4
## 19      19       22.7     48.2      27.1      14.8
## 20      20       25.2     51        27.5      21.1

dengan,

X1: Triceps Skinfold Thickness

X2: Thigh Circumference

X3: Midarm Circumference

Y: Body Fat

data dapat didownload melalui link berikut: (Click here)

#menghitung korelasi 
korelasi_y_x1<-cor(datacontoh$Triceps_X1,datacontoh$Bodyfat_Y, method = "pearson")
korelasi_y_x1

## [1] 0.8432654

#matrikskorelasi pearson
matriks<- cor(datacontoh[,-1])
round(matriks,2)

##            Triceps_X1 Thigh_X2 Midarm_X3 Bodyfat_Y
## Triceps_X1       1.00     0.92      0.46      0.84
## Thigh_X2         0.92     1.00      0.08      0.88
## Midarm_X3        0.46     0.08      1.00      0.14
## Bodyfat_Y        0.84     0.88      0.14      1.00

#membuat plot korelasi
library(ggplot2)
library(GGally)
ggpairs(datacontoh[,-1])

Model Regresi Linier Sederhana

Dugaan persamaan garis regresi linier sederhana:

$\hat{y_{i}}=b_{0}+b_{1}x_{i}$

dengan

$\hat{y_{i}}$ : Nilai dugaan y (peubah respon) pada pengamatan ke - i

$x_{i}$ : Nilai x (peubah penjelas) pada pengamatan ke - i

$b_{0}$ : dugaan bagi intersep ($\beta_{0}$)

$b_{1}$ : dugaan bagi $\beta_{1}$

Interpretasi koefisien regresi:

Besarnya koefisien $b_{0}$ menunjukkan nilai dugaan rataan y ketika x bernilai nol.

Besarnya koefisien $b_{1}$ menunjukkan besarnya perubahan y jika x berubah satu satuan (Tanda positif pada $b_{1}$ menunjukkan suatu kenaikan, sedangkan tanda negatif menunjukkan suatu penurunan)

Pengujian Parameter Regresi

Uji t (Uji Parsial)

Hipotesis:

$H_{0}:\beta_{1}=0$

$H_{1}:\beta_{1} \neq 0$

Statistik Uji:

dengan

Daerah Penolakan H0:

H0 ditolak jika $|t|>t_{\alpha/2 ; n-2}$

Uji F (Uji Simultan)

Hipotesis:

$H_{0}:\beta_{1}=0$

$H_{1}:$ Paling tidak terdapat satu $\beta_{i} \neq 0$

Statistik Uji:

dengan SS(Residual)=SSE

Daerah Penolakan H0:

H0 ditolak jika $F>F_{df1;df2;\alpha}$

Koefisien Determinasi ($R^{2}$)

Koefisien determinasi menunjukkan sejauh mana kontribusi Peubah bebas (penjelas) dalam model regresi mampu menjelaskan variasi dari peubah terikat (respon) nya.

Nilai $R^{2}$ berada pada rentang:

$0\leq R^{2} \leq 1$

Assumptions

Formal assumptions of regression analysis:

The relation is, in fact, linear, so that the errors all have expected value zero: $E(\varepsilon_{i})=0$ for all i.
The errors all have the same variance: $Var(\varepsilon_{i})=\sigma^{2}$ for all i.
The errors are independent of each other.
The errors are all normally distributed; $\varepsilon_{i}$ is normally distributed for all i.

Contoh

Data dari sampel 10 apotek digunakan untuk mengkaji hubungan antara volume penjualan resep dengan persentase bahan resep yang dibeli langsung dari supplier. Data sampel ditunjukkan pada Tabel di bawah ini.

Data dapat didownload melalui link berikut: [Link to download]

Soal:

Tentukan dugaan persamaan garis regresi linear sederhana: $\hat{y_{i}}=b_{0}+b_{1}x_{i}$
Prediksi volume penjualan apotek yang membeli 15% bahan resepnya langsung dari supplier.
Interpretasi $b_{1}$.
Tentukan nilai koefisien determinasi ($R^{2}$).
Apakah terdapat hubungan linier yang signifikan antara y (volume penjualan resep) dan x (persentase bahan resep yang dibeli langsung dari supplier)?

Jawab:

Using R:

library(readxl)
datareg1<-read_excel("D:/MATERI KULIAH S2 IPB/ASPRAK 2/REGRESSION.xlsx")
datareg1

## # A tibble: 10 x 2
##    sales_volume_Y Ingredients_X
##             <dbl>         <dbl>
##  1             25            10
##  2             55            18
##  3             50            25
##  4             75            40
##  5            110            50
##  6            138            63
##  7             90            42
##  8             60            30
##  9             10             5
## 10            100            55

model_reg <- lm(sales_volume_Y ~ Ingredients_X, data = datareg1)
summary(model_reg)

## 
## Call:
## lm(formula = sales_volume_Y ~ Ingredients_X, data = datareg1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.074  -4.403  -1.607   5.719  14.834 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.6979     5.9520   0.789    0.453    
## Ingredients_X   1.9705     0.1545  12.750 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.022 on 8 degrees of freedom
## Multiple R-squared:  0.9531, Adjusted R-squared:  0.9472 
## F-statistic: 162.6 on 1 and 8 DF,  p-value: 1.349e-06

a. Dugaan persamaan garis regresi linear sederhana

$\hat{y}=4.6979+1.9705x$

b. Hasil prediksi saat x=15%

$\hat{y}=4.70+1.97(15)=34.25$ ($34,250).

c. Interpretasi

Saat persentase bahan resep yang dibeli langsung dari supplier (x) meningkat 1% maka volume penjualan resep (y) pada apotek akan meningkat sebesar 1.97 USD.

d. Koefisien determinasi

95,31% -> 95,31% keragaman volume penjualan resep (y) dapat dijelaskan oleh persentase bahan resep yang dibeli langsung dari supplier (x).

e. t test for $\beta_{1}$

I.Hipotesis:

$H_{0}:\beta_{1}=0$ (Tidak Terdapat hubungan linier yang signifikan antara X dan Y)

$H_{1}:\beta_{1} \neq 0$ ( terdapat hubungan linier yang signifikan antara X dan Y)

II.Tingkat signifikansi: 5%

III.Statistik Uji

summary(model_reg)

## 
## Call:
## lm(formula = sales_volume_Y ~ Ingredients_X, data = datareg1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.074  -4.403  -1.607   5.719  14.834 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.6979     5.9520   0.789    0.453    
## Ingredients_X   1.9705     0.1545  12.750 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.022 on 8 degrees of freedom
## Multiple R-squared:  0.9531, Adjusted R-squared:  0.9472 
## F-statistic: 162.6 on 1 and 8 DF,  p-value: 1.349e-06

IV.Daerah Kritis:

H0 ditolak jika $|t|>t_{0.025; 8}=2.306$

V.Kesimpulan

Karena $t=12.750 >t_{0.025; 8}=2.306$ maka H0 ditolak sehingga dapat disimpulkan bahwa terdapat hubungan yang signifikan antara X dan Y.

Excercise

Soal 1

Forest scientists are concerned with the decline in forest growth throughout the world. One aspect of this decline is the possible effect of emissions from coal-fired power plants. The scientists in particular are interested in the pH level of the soil and the resulting impact on tree growth retardation. The scientists study various forests which are likely to be exposed to these emissions. They measure various aspects of growth associated with trees in a specified region and the soil pH in the same region. The forest scientists then want to determine impact on tree growth as the soil becomes more acidic. An index of growth retardation is constructed from the various measurements taken on the trees with a high value indicating greater retardation in tree growth. A higher value of soil pH indicates a more acidic soil. Twenty tree stands which are exposed to the power plant emissions are selected for study. The values of the growth retardation index and average soil pH are recorded in Table below.

Identify least-squares estimates for $\beta_{0}$ and $\beta_{1}$ in the model $Y=\beta_{0}+\beta_{1} X$ where y is the index of growth retardation and x is the soil pH.
Predict the growth retardation for a soil pH of 4.0.
Interpret $b_{1}$.
Find the coefficient determination ($R^{2}$).
Is there a significant linear relationship between y and x?

Soal 2

In baseball, two summaries of a team’s offensive ability are the team batting average (the proportion of times the team’s players get a hit, out of the times they are officially at bat) and team scoring (the team’s mean number of runs scored per game). The data below shows the 2010 statistics for the American League teams, from the AL Team Statistics data file on the text CD. Scoring runs is a result of hitting, so team scoring is the response variable y and team batting average is the explanatory variable x.

Identify least-squares estimates for $\beta_{0}$ and $\beta_{1}$ in the model $Y=\beta_{0}+\beta_{1} X$.
If a team has a batting average of 0.260 next year, what is their predicted mean number of runs per game?
How do you interpret the slope $b_{1}$ in this context?

Tugas dapat dikumpulkan melalui link berikut: (Click here)

STA1512-Analisis Korelasi dan Regresi