Module: 208251 Regression Analysis and Non-Parametric
Statistics
Instructor: Wisunee Puggard
Affiliation: Department of Statistics, Faculty of Science,
Chiang Mai University.
Students are able to use R language to
perform descriptive statistics.
construct scatterplot between two quantitative variables.
perform correlation analysis.
perform linear regression analysis and inference on regression
parameters.
interpret the results.
You can download data file in Class material on MANGO Canvas
Data file is in excel file .xls, we need to change it to be .csv
before importing it into R.
Import data into R
Note that your working directory (the place where the data
file is at) will be different from mine.
Obs Area Sale.price 1 1 13 115 2 2 4 45 3 3 12 100 4 4 5 50 5 5 6 55
6 6 8 85 7 7 3 40 8 8 4 50 9 9 5 45 10 10 7 70
Mydata=read.csv('/Users/wisuneepuggard/Desktop/ICAS2024/Chapter2_ExerciseI.csv'
,header=TRUE)
Mydata #call to see data and variables
## Obs Area Sale.price
## 1 1 13 115
## 2 2 4 45
## 3 3 12 100
## 4 4 5 50
## 5 5 6 55
## 6 6 8 85
## 7 7 3 40
## 8 8 4 50
## 9 9 5 45
## 10 10 7 70
x = Mydata$Area #set x (independent variable)
y = Mydata$Sale.price #set y (dependent variable)
plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")
cor(x,y)
## [1] 0.978902
cor.test(x, y,alternative = c("two.sided"),conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 13.55, df = 8, p-value = 8.451e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9103840 0.9951655
## sample estimates:
## cor
## 0.978902
fit = lm(y~x) #model fitting
summary(fit) #some basic information about the model is output
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.738 -4.618 0.987 2.269 9.741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.202 4.120 3.69 0.00613 **
## x 7.507 0.554 13.55 8.45e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.653 on 8 degrees of freedom
## Multiple R-squared: 0.9582, Adjusted R-squared: 0.953
## F-statistic: 183.6 on 1 and 8 DF, p-value: 8.451e-07
confint.lm(fit,level=0.95) #to obtain a confidence interval for the coefficient parameters,
## 2.5 % 97.5 %
## (Intercept) 5.701166 24.702293
## x 6.229628 8.784781
yhat = predict(fit) #compute fitted y
yhat
## 1 2 3 4 5 6 7 8
## 112.79539 45.23055 105.28818 52.73775 60.24496 75.25937 37.72334 45.23055
## 9 10
## 52.73775 67.75216
plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")
lines(x,yhat,type="o",pch=20)
res = resid(fit)
res
## 1 2 3 4 5 6 7
## 2.2046110 -0.2305476 -5.2881844 -2.7377522 -5.2449568 9.7406340 2.2766571
## 8 9 10
## 4.7694524 -7.7377522 2.2478386
par(mfrow=c(1,2)) #set plot layout as 1 row 2 column
plot(res)
hist(res)
with confidence interval and prediction interval at a value of
x=13:This returns three values
fit:the point or fitted estimate
lwr: the lower bound of the interval
upr: the upper bound of the interval
new = data.frame(x = 13)
pred = predict(fit, new)
pred
## 1
## 112.7954
pred.CI = predict(fit, newdata=new, interval = "confidence")
pred.CI
## fit lwr upr
## 1 112.7954 103.7525 121.8382
pred.PI = predict(fit, newdata=new, interval = "prediction")
pred.PI
## fit lwr upr
## 1 112.7954 96.93079 128.66
Fit the regression line and confidence interval and prediction interval for all values of x:
new = data.frame(x = sort(x))
new
## x
## 1 3
## 2 4
## 3 4
## 4 5
## 5 5
## 6 6
## 7 7
## 8 8
## 9 12
## 10 13
pred.CI = predict(fit, newdata=new, interval = "confidence")
pred.CI
## fit lwr upr
## 1 37.72334 31.45150 43.99519
## 2 45.23055 39.85561 50.60549
## 3 45.23055 39.85561 50.60549
## 4 52.73775 48.07854 57.39697
## 5 52.73775 48.07854 57.39697
## 6 60.24496 56.02702 64.46289
## 7 67.75216 63.61234 71.89198
## 8 75.25937 70.81531 79.70342
## 9 105.28818 97.36103 113.21534
## 10 112.79539 103.75253 121.83825
pred.PI = predict(fit, newdata=new, interval = "prediction")
pred.PI
## fit lwr upr
## 1 37.72334 23.25793 52.18875
## 2 45.23055 31.13083 59.33027
## 3 45.23055 31.13083 59.33027
## 4 52.73775 38.89505 66.58045
## 5 52.73775 38.89505 66.58045
## 6 60.24496 46.54448 73.94543
## 7 67.75216 54.07553 81.42879
## 8 75.25937 61.48759 89.03114
## 9 105.28818 90.03198 120.54439
## 10 112.79539 96.93079 128.65998
# create a plot
plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")
lines(x,yhat,type="o",pch=20)
# 95% confidence interval
points(new$x,pred.CI[,"lwr"],col="red",type="o",pch=20)
points(new$x,pred.CI[,"upr"],col="red",type="o",pch=20)
# 95% prediction interval
points(new$x,pred.PI[,"lwr"],col="blue",type="o",pch=20)
points(new$x,pred.PI[,"upr"],col="blue",type="o",pch=20)
Note that your working directory (the place where the data file is at) will be different from mine.
Mydata2 <- read.csv('/Users/wisuneepuggard/Desktop/ICAS2024/JobSatisfaction.csv'
,header=TRUE)
Mydata2 #To view dataset
## Obs Age Gender Status Education Smoke Income Life.Satisfaction
## 1 1 33 Male Married Undergrad No 15050 3
## 2 2 27 Male Married Undergrad Yes 20600 2
## 3 3 22 Female Single Master Yes 15200 4
## 4 4 34 Male Married Undergrad No 22500 3
## 5 5 24 Female Single High school No 30400 4
## 6 6 37 Female Divorced Undergrad No 33850 3
## 7 7 25 Female Single Master Yes 15100 2
## 8 8 31 Male Married Undergrad Yes 34900 1
## 9 9 40 Male Married Undergrad No 23200 1
## 10 10 26 Male Single Secondary school Yes 32000 1
## 11 11 41 Female Married High school Yes 18650 2
## 12 12 38 Male Married High school No 28570 2
## 13 13 28 Female Single Undergrad No 26000 2
## 14 14 44 Male Divorced High school Yes 39560 3
## 15 15 21 Female Married Undergrad Yes 18000 4
## 16 16 25 Male Single Undergrad No 13500 2
## 17 17 30 Female Single Secondary school No 22350 1
## 18 18 42 Male Married Undergrad Yes 35000 3
## 19 19 43 Female Married Undergrad No 28000 4
## 20 20 20 Male Single Master No 15000 3
## Job.Satisfaction
## 1 3
## 2 4
## 3 1
## 4 4
## 5 3
## 6 3
## 7 3
## 8 2
## 9 3
## 10 3
## 11 1
## 12 3
## 13 4
## 14 2
## 15 3
## 16 1
## 17 1
## 18 3
## 19 3
## 20 4
str(Mydata2) #To view type of data
## 'data.frame': 20 obs. of 9 variables:
## $ Obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 33 27 22 34 24 37 25 31 40 26 ...
## $ Gender : chr "Male" "Male" "Female" "Male" ...
## $ Status : chr "Married" "Married" "Single" "Married" ...
## $ Education : chr "Undergrad" "Undergrad" "Master " "Undergrad" ...
## $ Smoke : chr "No" "Yes" "Yes" "No" ...
## $ Income : int 15050 20600 15200 22500 30400 33850 15100 34900 23200 32000 ...
## $ Life.Satisfaction: int 3 2 4 3 4 3 2 1 1 1 ...
## $ Job.Satisfaction : int 3 4 1 4 3 3 3 2 3 3 ...
# frequency table of gender
table1 <- table(Mydata2$Gender)
table1
##
## Female Male
## 9 11
prop.table(table1) #compute percentage
##
## Female Male
## 0.45 0.55
# percentages for gender by highest education
table2 <- table(Mydata2$Education, Mydata2$Gender)
table2
##
## Female Male
## High school 2 2
## Master 2 1
## Secondary school 1 1
## Undergrad 4 7
prop.table(table2)
##
## Female Male
## High school 0.10 0.10
## Master 0.10 0.05
## Secondary school 0.05 0.05
## Undergrad 0.20 0.35
# Descriptive statistics for quantitative data
summary(Mydata2$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 25.00 30.50 31.55 38.50 44.00
summary(Mydata2$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13500 17300 22850 24372 30800 39560
boxplot(Mydata2$Income)
# Descriptive between qualitative and quantitative
boxplot(Mydata2$Income~Mydata2$Gender,xlab="Gender",ylab="Income")
boxplot(Mydata2$Income~Mydata2$Job.Satisfaction,xlab="Job Satisfaction",ylab="Income")
x = Mydata2$Age #set x (independent variable)
y = Mydata2$Income #set y (dependent variable)
plot(x,y,xlab="Age",ylab="Income",main="Scatterplot of age and income")
Compute the correlation coefficient and hypothesis test for
significant linear correlation:
cor(x,y)
## [1] 0.5424278
cor.test(x, y,alternative = c("two.sided"),conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 2.7393, df = 18, p-value = 0.01348
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1314628 0.7942906
## sample estimates:
## cor
## 0.5424278
fit = lm(y~x) #model fitting
summary(fit) #some basic information about the model is output
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10957 -4321 -1207 5244 10833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6893.1 6567.0 1.050 0.3078
## x 554.0 202.2 2.739 0.0135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6950 on 18 degrees of freedom
## Multiple R-squared: 0.2942, Adjusted R-squared: 0.255
## F-statistic: 7.504 on 1 and 18 DF, p-value: 0.01348
confint.lm(fit) #to obtain a confidence interval for the coefficient estimates,
## 2.5 % 97.5 %
## (Intercept) -6903.6540 20689.9379
## x 129.1102 978.8681
yhat = predict(fit) #compute fitted y
yhat
## 1 2 3 4 5 6 7 8
## 25174.78 21850.85 19080.90 25728.77 20188.88 27390.74 20742.87 24066.81
## 9 10 11 12 13 14 15 16
## 29052.71 21296.86 29606.70 27944.73 22404.84 31268.67 18526.91 20742.87
## 17 18 19 20
## 23512.82 30160.69 30714.68 17972.93
plot(x,y,xlab="Age",ylab="Income",main="Linear fitted line")
lines(x,yhat,type="o",pch=20)
res = resid(fit)
res
## 1 2 3 4 5 6
## -10124.7843 -1250.8493 -3880.9035 -3228.7734 10211.1182 6459.2591
## 7 8 9 10 11 12
## -5642.8710 10833.1940 -5852.7084 10703.1398 -10956.6976 625.2699
## 13 14 15 16 17 18
## 3595.1615 8291.3349 -526.9143 -7242.8710 -1162.8168 4839.3133
## 19 20
## -2714.6759 -2972.9252
par(mfrow=c(1,2)) #set plot layout as 1 row 2 column
plot(res)
hist(res)
Prediction with confidence interval and prediction interval for all values of x:
new2 = data.frame(x = sort(x))
new2
## x
## 1 20
## 2 21
## 3 22
## 4 24
## 5 25
## 6 25
## 7 26
## 8 27
## 9 28
## 10 30
## 11 31
## 12 33
## 13 34
## 14 37
## 15 38
## 16 40
## 17 41
## 18 42
## 19 43
## 20 44
pred.CI = predict(fit, newdata=new2, interval = "confidence",se.fit=TRUE)
pred.CI
## $fit
## fit lwr upr
## 1 17972.93 12078.73 23867.12
## 2 18526.91 12981.47 24072.36
## 3 19080.90 13872.89 24288.92
## 4 20188.88 15611.81 24765.95
## 5 20742.87 16452.86 25032.89
## 6 20742.87 16452.86 25032.89
## 7 21296.86 17269.46 25324.26
## 8 21850.85 18056.56 25645.14
## 9 22404.84 18808.39 26001.28
## 10 23512.82 20182.19 26843.45
## 11 24066.81 20793.58 27340.03
## 12 25174.78 21852.30 28497.27
## 13 25728.77 22301.97 29155.57
## 14 27390.74 23388.07 31393.41
## 15 27944.73 23682.15 32207.31
## 16 29052.71 24199.97 33905.45
## 17 29606.70 24431.71 34781.68
## 18 30160.69 24649.53 35671.84
## 19 30714.68 24855.81 36573.54
## 20 31268.67 25052.50 37484.83
##
## $se.fit
## 1 2 3 4 5 6 7 8
## 2805.526 2639.529 2478.920 2178.602 2041.968 2041.968 1916.966 1806.012
## 9 10 11 12 13 14 15 16
## 1711.841 1585.317 1557.994 1581.443 1631.094 1905.195 2028.907 2309.816
## 17 18 19 20
## 2463.197 2623.209 2788.711 2958.782
##
## $df
## [1] 18
##
## $residual.scale
## [1] 6949.78
pred.PI = predict(fit, newdata=new2, interval = "prediction",se.fit=TRUE)
pred.PI
## $fit
## fit lwr upr
## 1 17972.93 2227.160 33718.69
## 2 18526.91 2908.349 34145.48
## 3 19080.90 3578.933 34582.87
## 4 20188.88 4887.339 35490.43
## 5 20742.87 5524.728 35961.01
## 6 20742.87 5524.728 35961.01
## 7 21296.86 6150.654 36443.07
## 8 21850.85 6764.953 36936.75
## 9 22404.84 7367.484 37442.19
## 10 23512.82 8536.812 38488.82
## 11 24066.81 9103.463 39030.15
## 12 25174.78 10200.588 40148.98
## 13 25728.77 10731.087 40726.46
## 14 27390.74 12251.092 42530.39
## 15 27944.73 12734.300 43155.16
## 16 29052.71 13666.457 44438.96
## 17 29606.70 14115.794 45097.60
## 18 30160.69 14554.262 45767.11
## 19 30714.68 14982.101 46447.25
## 20 31268.67 15399.563 47137.77
##
## $se.fit
## 1 2 3 4 5 6 7 8
## 2805.526 2639.529 2478.920 2178.602 2041.968 2041.968 1916.966 1806.012
## 9 10 11 12 13 14 15 16
## 1711.841 1585.317 1557.994 1581.443 1631.094 1905.195 2028.907 2309.816
## 17 18 19 20
## 2463.197 2623.209 2788.711 2958.782
##
## $df
## [1] 18
##
## $residual.scale
## [1] 6949.78
# create a plot
plot(x,y,xlab="Age",ylab="Income",main="Linear fitted line with confidence and prediction interval",
ylim=c(5000,55000))
lines(x,yhat,type="o",pch=20)
# 95% confidence interval
points(new2$x,pred.CI$fit[,"lwr"],col="red",type="o",pch=20)
points(new2$x,pred.CI$fit[,"upr"],col="red",type="o",pch=20)
# 95% prediction interval
points(new2$x,pred.PI$fit[,"lwr"],col="blue",type="o",pch=20)
points(new2$x,pred.PI$fit[,"upr"],col="blue",type="o",pch=20)
You must submit:
R file with your codes, and
Answer sheet with your handwriting
On Mango canvas, see the deadline there!
A teacher from Statistics class wants to examine the relationship between final and midterm score of her students. The midterm and final scores of 20 randomly chosen students are shown below
| Student | Midterm | Final |
|---|---|---|
| 1 | 77 | 82 |
| 2 | 50 | 66 |
| 3 | 71 | 78 |
| 4 | 72 | 43 |
| 5 | 80 | 56 |
| 6 | 93 | 85 |
| 7 | 95 | 99 |
| 8 | 98 | 98 |
| 9 | 66 | 68 |
| 10 | 55 | 45 |
| 11 | 63 | 75 |
| 12 | 62 | 45 |
| 13 | 51 | 71 |
| 14 | 65 | 60 |
| 15 | 90 | 87 |
| 16 | 74 | 68 |
| 17 | 48 | 56 |
| 18 | 67 | 85 |
| 19 | 67 | 60 |
| 20 | 85 | 90 |
Use R language to:
Find sample correlation between midterm and final scores and
interpret its value.
Test the population correlation whether midterm and final scores are correlated at significance level 0.05 (Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion).
Find the 95% confidence interval of the population correlation.
Make a scatterplot between midterm and final scores. From the scatterplot, describe the form, direction, and strength of the relationship. Is the relationship approximately linear?
Obtain the least-squares regression line for predicting final score from midterm score.
Find the value of MSres
From the regression line, test whether midterm score can be used to described final score at significance level 0.05. (Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion)
Find the 95% confidence interval of regression parameters
Find 𝑟2 and interpret its value for these data
Find 95% confidence and prediction interval of the final score for the midterm score of 75.
Plot linear fitted line and their 95% confidence interval and prediction interval for all values of x.
Write a short paragraph to summarize the results.