STA1512-Analisis Korelasi dan Regresi
Pengertian
Analisis korelasi
digunakan untuk mengukur kekuatan hubungan (hubungan linier) antara dua peubah.Analisis regresi
merupakan alat statistika untuk mengevaluasi hubungan antara satu peubah dengan satu peubah lainnya, atau satu peubah dengan beberapa peubah lainnya.
Korelasi
Nilai dari koefisien korelasi berkisar antara -1 dan 1.
Tanda (+) menunjukkan hubungan searah, sedangkan tanda (-) menunjukkan hubungan berlawanan.
Contoh
library(readxl)
library(lmtest)
<- read_excel("D:/MATERI KULIAH S2 IPB/SEMESTER 1/ANALISIS STATISTIKA/dataregresi.xlsx")
datacontoh datacontoh
## # A tibble: 20 x 5
## Subject Triceps_X1 Thigh_X2 Midarm_X3 Bodyfat_Y
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 19.5 43.1 29.1 11.9
## 2 2 24.7 49.8 28.2 22.8
## 3 3 30.7 51.9 37 18.7
## 4 4 29.8 54.3 31.1 20.1
## 5 5 19.1 42.2 30.9 12.9
## 6 6 25.6 53.9 23.7 21.7
## 7 7 31.4 58.5 27.6 27.1
## 8 8 27.9 52.1 30.6 25.4
## 9 9 22.1 49.9 23.2 21.3
## 10 10 25.5 53.5 24.8 19.3
## 11 11 31.1 56.6 30 25.4
## 12 12 30.4 56.7 28.3 27.2
## 13 13 18.7 46.5 23 11.7
## 14 14 19.7 44.2 28.6 17.8
## 15 15 14.6 42.7 21.3 12.8
## 16 16 29.5 54.4 30.1 23.9
## 17 17 27.7 55.3 25.7 22.6
## 18 18 30.2 58.6 24.6 25.4
## 19 19 22.7 48.2 27.1 14.8
## 20 20 25.2 51 27.5 21.1
dengan,
X1: Triceps Skinfold Thickness
X2: Thigh Circumference
X3: Midarm Circumference
Y: Body Fat
data dapat didownload melalui link berikut: (Click here)
#menghitung korelasi
<-cor(datacontoh$Triceps_X1,datacontoh$Bodyfat_Y, method = "pearson")
korelasi_y_x1 korelasi_y_x1
## [1] 0.8432654
#matrikskorelasi pearson
<- cor(datacontoh[,-1])
matriksround(matriks,2)
## Triceps_X1 Thigh_X2 Midarm_X3 Bodyfat_Y
## Triceps_X1 1.00 0.92 0.46 0.84
## Thigh_X2 0.92 1.00 0.08 0.88
## Midarm_X3 0.46 0.08 1.00 0.14
## Bodyfat_Y 0.84 0.88 0.14 1.00
#membuat plot korelasi
library(ggplot2)
library(GGally)
ggpairs(datacontoh[,-1])
Model Regresi Linier Sederhana
Dugaan persamaan garis regresi linier sederhana:
\(\hat{y_{i}}=b_{0}+b_{1}x_{i}\)
dengan
\(\hat{y_{i}}\) : Nilai dugaan y (peubah respon) pada pengamatan ke - i
\(x_{i}\) : Nilai x (peubah penjelas) pada pengamatan ke - i
\(b_{0}\) : dugaan bagi intersep (\(\beta_{0}\))
\(b_{1}\) : dugaan bagi \(\beta_{1}\)
Interpretasi koefisien regresi:
Besarnya koefisien \(b_{0}\) menunjukkan nilai dugaan rataan y ketika x bernilai nol.
Besarnya koefisien \(b_{1}\) menunjukkan besarnya perubahan y jika x berubah satu satuan (Tanda positif pada \(b_{1}\) menunjukkan suatu kenaikan, sedangkan tanda negatif menunjukkan suatu penurunan)
Pengujian Parameter Regresi
Uji t (Uji Parsial)
Hipotesis:
\(H_{0}:\beta_{1}=0\)
\(H_{1}:\beta_{1} \neq 0\)
Statistik Uji:
dengan
Daerah Penolakan H0:
H0 ditolak jika \(|t|>t_{\alpha/2 ; n-2}\)
Uji F (Uji Simultan)
Hipotesis:
\(H_{0}:\beta_{1}=0\)
\(H_{1}:\) Paling tidak terdapat satu \(\beta_{i} \neq 0\)
Statistik Uji:
dengan SS(Residual)=SSE
Daerah Penolakan H0:
H0 ditolak jika \(F>F_{df1;df2;\alpha}\)
Koefisien Determinasi (\(R^{2}\))
Koefisien determinasi menunjukkan sejauh mana kontribusi Peubah bebas (penjelas) dalam model regresi mampu menjelaskan variasi dari peubah terikat (respon) nya.
Nilai \(R^{2}\) berada pada rentang:
\(0\leq R^{2} \leq 1\)
Assumptions
Formal assumptions of regression analysis:
The relation is, in fact, linear, so that the errors all have expected value zero: \(E(\varepsilon_{i})=0\) for all i.
The errors all have the same variance: \(Var(\varepsilon_{i})=\sigma^{2}\) for all i.
The errors are independent of each other.
The errors are all normally distributed; \(\varepsilon_{i}\) is normally distributed for all i.
Contoh
Data dari sampel 10 apotek digunakan untuk mengkaji hubungan antara volume penjualan resep dengan persentase bahan resep yang dibeli langsung dari supplier. Data sampel ditunjukkan pada Tabel di bawah ini.
Data dapat didownload melalui link berikut: [Link to download]
Soal:
Tentukan dugaan persamaan garis regresi linear sederhana: \(\hat{y_{i}}=b_{0}+b_{1}x_{i}\)
Prediksi volume penjualan apotek yang membeli 15% bahan resepnya langsung dari supplier.
Interpretasi \(b_{1}\).
Tentukan nilai koefisien determinasi (\(R^{2}\)).
Apakah terdapat hubungan linier yang signifikan antara y (volume penjualan resep) dan x (persentase bahan resep yang dibeli langsung dari supplier)?
Jawab:
Using R:
library(readxl)
<-read_excel("D:/MATERI KULIAH S2 IPB/ASPRAK 2/REGRESSION.xlsx")
datareg1 datareg1
## # A tibble: 10 x 2
## sales_volume_Y Ingredients_X
## <dbl> <dbl>
## 1 25 10
## 2 55 18
## 3 50 25
## 4 75 40
## 5 110 50
## 6 138 63
## 7 90 42
## 8 60 30
## 9 10 5
## 10 100 55
<- lm(sales_volume_Y ~ Ingredients_X, data = datareg1)
model_reg summary(model_reg)
##
## Call:
## lm(formula = sales_volume_Y ~ Ingredients_X, data = datareg1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.074 -4.403 -1.607 5.719 14.834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6979 5.9520 0.789 0.453
## Ingredients_X 1.9705 0.1545 12.750 1.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.022 on 8 degrees of freedom
## Multiple R-squared: 0.9531, Adjusted R-squared: 0.9472
## F-statistic: 162.6 on 1 and 8 DF, p-value: 1.349e-06
a. Dugaan persamaan garis regresi linear sederhana
\(\hat{y}=4.6979+1.9705x\)
b. Hasil prediksi saat x=15%
\(\hat{y}=4.70+1.97(15)=34.25\) ($34,250).
c. Interpretasi
Saat persentase bahan resep yang dibeli langsung dari supplier (x) meningkat 1% maka volume penjualan resep (y) pada apotek akan meningkat sebesar 1.97 USD.
d. Koefisien determinasi
95,31% -> 95,31% keragaman volume penjualan resep (y) dapat dijelaskan oleh persentase bahan resep yang dibeli langsung dari supplier (x).
e. t test for \(\beta_{1}\)
I.Hipotesis:
\(H_{0}:\beta_{1}=0\) (Tidak Terdapat hubungan linier yang signifikan antara X dan Y)
\(H_{1}:\beta_{1} \neq 0\) ( terdapat hubungan linier yang signifikan antara X dan Y)
II.Tingkat signifikansi: 5%
III.Statistik Uji
summary(model_reg)
##
## Call:
## lm(formula = sales_volume_Y ~ Ingredients_X, data = datareg1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.074 -4.403 -1.607 5.719 14.834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6979 5.9520 0.789 0.453
## Ingredients_X 1.9705 0.1545 12.750 1.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.022 on 8 degrees of freedom
## Multiple R-squared: 0.9531, Adjusted R-squared: 0.9472
## F-statistic: 162.6 on 1 and 8 DF, p-value: 1.349e-06
IV.Daerah Kritis:
H0 ditolak jika \(|t|>t_{0.025; 8}=2.306\)
V.Kesimpulan
Karena \(t=12.750 >t_{0.025; 8}=2.306\) maka H0 ditolak sehingga dapat disimpulkan bahwa terdapat hubungan yang signifikan antara X dan Y.
Excercise
Soal 1
- Forest scientists are concerned with the decline in forest growth throughout the world. One aspect of this decline is the possible effect of emissions from coal-fired power plants. The scientists in particular are interested in the pH level of the soil and the resulting impact on tree growth retardation. The scientists study various forests which are likely to be exposed to these emissions. They measure various aspects of growth associated with trees in a specified region and the soil pH in the same region. The forest scientists then want to determine impact on tree growth as the soil becomes more acidic. An index of growth retardation is constructed from the various measurements taken on the trees with a high value indicating greater retardation in tree growth. A higher value of soil pH indicates a more acidic soil. Twenty tree stands which are exposed to the power plant emissions are selected for study. The values of the growth retardation index and average soil pH are recorded in Table below.
Identify least-squares estimates for \(\beta_{0}\) and \(\beta_{1}\) in the model \(Y=\beta_{0}+\beta_{1} X\) where y is the index of growth retardation and x is the soil pH.
Predict the growth retardation for a soil pH of 4.0.
Interpret \(b_{1}\).
Find the coefficient determination (\(R^{2}\)).
Is there a significant linear relationship between y and x?
Soal 2
In baseball, two summaries of a team’s offensive ability are the team batting average (the proportion of times the team’s players get a hit, out of the times they are officially at bat) and team scoring (the team’s mean number of runs scored per game). The data below shows the 2010 statistics for the American League teams, from the AL Team Statistics data file on the text CD. Scoring runs is a result of hitting, so team scoring is the response variable y and team batting average is the explanatory variable x.
Identify least-squares estimates for \(\beta_{0}\) and \(\beta_{1}\) in the model \(Y=\beta_{0}+\beta_{1} X\).
If a team has a batting average of 0.260 next year, what is their predicted mean number of runs per game?
How do you interpret the slope \(b_{1}\) in this context?
Tugas dapat dikumpulkan melalui link berikut: (Click here)