Module: 208251 Regression Analysis and Non-Parametric Statistics
Instructor: Parichart Pattarapanitchai
Affiliation: Department of Statistics, Faculty of Science, Chiang Mai University.

Objectives

Students are able to use R language to

  1. perform descriptive statistics

  2. construct scatterplot between two quantitative variables

  3. perform correlation analysis

  4. perform linear regression analysis and inference on regression parameters

  5. interpret the results

Exercise I:

The data file is on github. You must install RCurl package if you have not ever done yet in your computer. Once it is installed, load RCurl package to get data from repository.

1. Import data into R

# install.packages("RCurl") # install 'RCurl' package if not done yet
library(RCurl)  # load 'RCurl' package
Mydata <- read.csv(text=getURL("https://raw.githubusercontent.com/Paripai/208251/main/AreaAndSalePrice.csv"))
Mydata    #call to see data and variables
##    Obs Area Sale.price
## 1    1   13        115
## 2    2    4         45
## 3    3   12        100
## 4    4    5         50
## 5    5    6         55
## 6    6    8         85
## 7    7    3         40
## 8    8    4         50
## 9    9    5         45
## 10  10    7         70

2. Create scatterplot

x = Mydata$Area  #set x (independent variable)
y = Mydata$Sale.price #set y (dependent variable)

plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")

3. Compute the correlation coefficient

Compute the correlation coefficient and hypothesis test for significant linear correlation:

cor(x,y)  
## [1] 0.978902
cor.test(x, y,alternative = c("two.sided"),conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 13.55, df = 8, p-value = 8.451e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9103840 0.9951655
## sample estimates:
##      cor 
## 0.978902

4. Fit regression line

fit = lm(y~x) #model fitting
summary(fit) #some basic information about the model is output
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.738 -4.618  0.987  2.269  9.741 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.202      4.120    3.69  0.00613 ** 
## x              7.507      0.554   13.55 8.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.653 on 8 degrees of freedom
## Multiple R-squared:  0.9582, Adjusted R-squared:  0.953 
## F-statistic: 183.6 on 1 and 8 DF,  p-value: 8.451e-07
confint.lm(fit,level=0.95) #to obtain a confidence interval for the coefficient parameters,
##                2.5 %    97.5 %
## (Intercept) 5.701166 24.702293
## x           6.229628  8.784781

5. Plot the fitted line

yhat = predict(fit) #compute fitted y
yhat 
##         1         2         3         4         5         6         7         8 
## 112.79539  45.23055 105.28818  52.73775  60.24496  75.25937  37.72334  45.23055 
##         9        10 
##  52.73775  67.75216
plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")
lines(x,yhat,type="o",pch=20)

6. Compute the residuals

res = resid(fit)
res
##          1          2          3          4          5          6          7 
##  2.2046110 -0.2305476 -5.2881844 -2.7377522 -5.2449568  9.7406340  2.2766571 
##          8          9         10 
##  4.7694524 -7.7377522  2.2478386
par(mfrow=c(1,2))  #set plot layout as 1 row 2 column
plot(res)
hist(res)

7. Prediction

Prediction with confidence interval and prediction interval at a value of x=13: This returns three values

fit:the point or fitted estimate
lwr: the lower bound of the interval
upr: the upper bound of the interval

new = data.frame(x = 13)
pred = predict(fit, new)
pred
##        1 
## 112.7954
pred.CI = predict(fit, newdata=new, interval = "confidence")
pred.CI 
##        fit      lwr      upr
## 1 112.7954 103.7525 121.8382
pred.PI = predict(fit, newdata=new, interval = "prediction")
pred.PI 
##        fit      lwr    upr
## 1 112.7954 96.93079 128.66

8. Fit the regression line

Fit the regression line and confidence interval and prediction interval for all values of x:

new = data.frame(x = sort(x))
new
##     x
## 1   3
## 2   4
## 3   4
## 4   5
## 5   5
## 6   6
## 7   7
## 8   8
## 9  12
## 10 13
pred.CI = predict(fit, newdata=new, interval = "confidence")
pred.CI
##          fit       lwr       upr
## 1   37.72334  31.45150  43.99519
## 2   45.23055  39.85561  50.60549
## 3   45.23055  39.85561  50.60549
## 4   52.73775  48.07854  57.39697
## 5   52.73775  48.07854  57.39697
## 6   60.24496  56.02702  64.46289
## 7   67.75216  63.61234  71.89198
## 8   75.25937  70.81531  79.70342
## 9  105.28818  97.36103 113.21534
## 10 112.79539 103.75253 121.83825
pred.PI = predict(fit, newdata=new, interval = "prediction")
pred.PI
##          fit      lwr       upr
## 1   37.72334 23.25793  52.18875
## 2   45.23055 31.13083  59.33027
## 3   45.23055 31.13083  59.33027
## 4   52.73775 38.89505  66.58045
## 5   52.73775 38.89505  66.58045
## 6   60.24496 46.54448  73.94543
## 7   67.75216 54.07553  81.42879
## 8   75.25937 61.48759  89.03114
## 9  105.28818 90.03198 120.54439
## 10 112.79539 96.93079 128.65998
# create a plot
plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")
lines(x,yhat,type="o",pch=20)
# 95% confidence interval
points(new$x,pred.CI[,"lwr"],col="red",type="o",pch=20) 
points(new$x,pred.CI[,"upr"],col="red",type="o",pch=20)
# 95% prediction interval
points(new$x,pred.PI[,"lwr"],col="blue",type="o",pch=20) 
points(new$x,pred.PI[,"upr"],col="blue",type="o",pch=20)

Exercise II

Job and Life Satisfaction

1. Import data into R

Mydata2 <- read.csv(text=getURL("https://raw.githubusercontent.com/Paripai/208251/main/JobSatisfaction.csv"))

2. Data exploratory

head(Mydata2)    #To view dataset
##   Obs Age Gender   Status   Education Smoke Income Life.Satisfaction
## 1   1  33   Male  Married   Undergrad    No  15050                 3
## 2   2  27   Male  Married   Undergrad   Yes  20600                 2
## 3   3  22 Female   Single     Master    Yes  15200                 4
## 4   4  34   Male  Married   Undergrad    No  22500                 3
## 5   5  24 Female   Single High school    No  30400                 4
## 6   6  37 Female Divorced   Undergrad    No  33850                 3
##   Job.Satisfaction
## 1                3
## 2                4
## 3                1
## 4                4
## 5                3
## 6                3
str(Mydata2) #To view type of data
## 'data.frame':    20 obs. of  9 variables:
##  $ Obs              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age              : int  33 27 22 34 24 37 25 31 40 26 ...
##  $ Gender           : chr  "Male" "Male" "Female" "Male" ...
##  $ Status           : chr  "Married" "Married" "Single" "Married" ...
##  $ Education        : chr  "Undergrad" "Undergrad" "Master " "Undergrad" ...
##  $ Smoke            : chr  "No" "Yes" "Yes" "No" ...
##  $ Income           : int  15050 20600 15200 22500 30400 33850 15100 34900 23200 32000 ...
##  $ Life.Satisfaction: int  3 2 4 3 4 3 2 1 1 1 ...
##  $ Job.Satisfaction : int  3 4 1 4 3 3 3 2 3 3 ...
# frequency table of gender
table1 <- table(Mydata2$Gender)
table1
## 
## Female   Male 
##      9     11
prop.table(table1) #compute percentage
## 
## Female   Male 
##   0.45   0.55
# percentages for gender by highest education
table2 <- table(Mydata2$Education, Mydata2$Gender)
table2
##                   
##                    Female Male
##   High school           2    2
##   Master                2    1
##   Secondary school      1    1
##   Undergrad             4    7
prop.table(table2)
##                   
##                    Female Male
##   High school        0.10 0.10
##   Master             0.10 0.05
##   Secondary school   0.05 0.05
##   Undergrad          0.20 0.35
# Descriptive statistics for quantitative data
summary(Mydata2$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   25.00   30.50   31.55   38.50   44.00
summary(Mydata2$Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13500   17300   22850   24372   30800   39560
boxplot(Mydata2$Income)

# Descriptive between qualitative and quantitative
boxplot(Mydata2$Income~Mydata2$Gender,xlab="Gender",ylab="Income")

boxplot(Mydata2$Income~Mydata2$Job.Satisfaction,xlab="Job Satisfaction",ylab="Income")

3. Create scatterplot

x = Mydata2$Age      #set x (independent variable)
y = Mydata2$Income   #set y (dependent variable)

plot(x,y,xlab="Age",ylab="Income",main="Scatterplot of age and income")

4. Compute the correlation coefficient

Compute the correlation coefficient and hypothesis test for significant linear correlation:

cor(x,y)  
## [1] 0.5424278
cor.test(x, y,alternative = c("two.sided"),conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 2.7393, df = 18, p-value = 0.01348
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1314628 0.7942906
## sample estimates:
##       cor 
## 0.5424278

5. Fit regression line

fit = lm(y~x) #model fitting
summary(fit) #some basic information about the model is output
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10957  -4321  -1207   5244  10833 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   6893.1     6567.0   1.050   0.3078  
## x              554.0      202.2   2.739   0.0135 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6950 on 18 degrees of freedom
## Multiple R-squared:  0.2942, Adjusted R-squared:  0.255 
## F-statistic: 7.504 on 1 and 18 DF,  p-value: 0.01348
confint.lm(fit) #to obtain a confidence interval for the coefficient estimates,
##                  2.5 %     97.5 %
## (Intercept) -6903.6540 20689.9379
## x             129.1102   978.8681

6. Plot the fitted line

yhat = predict(fit) #compute fitted y
yhat
##        1        2        3        4        5        6        7        8 
## 25174.78 21850.85 19080.90 25728.77 20188.88 27390.74 20742.87 24066.81 
##        9       10       11       12       13       14       15       16 
## 29052.71 21296.86 29606.70 27944.73 22404.84 31268.67 18526.91 20742.87 
##       17       18       19       20 
## 23512.82 30160.69 30714.68 17972.93
plot(x,y,xlab="Age",ylab="Income",main="Linear fitted line")
lines(x,yhat,type="o",pch=20)

7. Compute the residuals

res = resid(fit)
res
##           1           2           3           4           5           6 
## -10124.7843  -1250.8493  -3880.9035  -3228.7734  10211.1182   6459.2591 
##           7           8           9          10          11          12 
##  -5642.8710  10833.1940  -5852.7084  10703.1398 -10956.6976    625.2699 
##          13          14          15          16          17          18 
##   3595.1615   8291.3349   -526.9143  -7242.8710  -1162.8168   4839.3133 
##          19          20 
##  -2714.6759  -2972.9252
par(mfrow=c(1,2))  #set plot layout as 1 row 2 column
plot(res)
hist(res)

8. Prediction

Prediction with confidence interval and prediction interval for all values of x:

new2 = data.frame(x = sort(x))
new2
##     x
## 1  20
## 2  21
## 3  22
## 4  24
## 5  25
## 6  25
## 7  26
## 8  27
## 9  28
## 10 30
## 11 31
## 12 33
## 13 34
## 14 37
## 15 38
## 16 40
## 17 41
## 18 42
## 19 43
## 20 44
pred.CI = predict(fit, newdata=new2, interval = "confidence",se.fit=TRUE)
pred.CI
## $fit
##         fit      lwr      upr
## 1  17972.93 12078.73 23867.12
## 2  18526.91 12981.47 24072.36
## 3  19080.90 13872.89 24288.92
## 4  20188.88 15611.81 24765.95
## 5  20742.87 16452.86 25032.89
## 6  20742.87 16452.86 25032.89
## 7  21296.86 17269.46 25324.26
## 8  21850.85 18056.56 25645.14
## 9  22404.84 18808.39 26001.28
## 10 23512.82 20182.19 26843.45
## 11 24066.81 20793.58 27340.03
## 12 25174.78 21852.30 28497.27
## 13 25728.77 22301.97 29155.57
## 14 27390.74 23388.07 31393.41
## 15 27944.73 23682.15 32207.31
## 16 29052.71 24199.97 33905.45
## 17 29606.70 24431.71 34781.68
## 18 30160.69 24649.53 35671.84
## 19 30714.68 24855.81 36573.54
## 20 31268.67 25052.50 37484.83
## 
## $se.fit
##        1        2        3        4        5        6        7        8 
## 2805.526 2639.529 2478.920 2178.602 2041.968 2041.968 1916.966 1806.012 
##        9       10       11       12       13       14       15       16 
## 1711.841 1585.317 1557.994 1581.443 1631.094 1905.195 2028.907 2309.816 
##       17       18       19       20 
## 2463.197 2623.209 2788.711 2958.782 
## 
## $df
## [1] 18
## 
## $residual.scale
## [1] 6949.78
pred.PI = predict(fit, newdata=new2, interval = "prediction",se.fit=TRUE)
pred.PI
## $fit
##         fit       lwr      upr
## 1  17972.93  2227.160 33718.69
## 2  18526.91  2908.349 34145.48
## 3  19080.90  3578.933 34582.87
## 4  20188.88  4887.339 35490.43
## 5  20742.87  5524.728 35961.01
## 6  20742.87  5524.728 35961.01
## 7  21296.86  6150.654 36443.07
## 8  21850.85  6764.953 36936.75
## 9  22404.84  7367.484 37442.19
## 10 23512.82  8536.812 38488.82
## 11 24066.81  9103.463 39030.15
## 12 25174.78 10200.588 40148.98
## 13 25728.77 10731.087 40726.46
## 14 27390.74 12251.092 42530.39
## 15 27944.73 12734.300 43155.16
## 16 29052.71 13666.457 44438.96
## 17 29606.70 14115.794 45097.60
## 18 30160.69 14554.262 45767.11
## 19 30714.68 14982.101 46447.25
## 20 31268.67 15399.563 47137.77
## 
## $se.fit
##        1        2        3        4        5        6        7        8 
## 2805.526 2639.529 2478.920 2178.602 2041.968 2041.968 1916.966 1806.012 
##        9       10       11       12       13       14       15       16 
## 1711.841 1585.317 1557.994 1581.443 1631.094 1905.195 2028.907 2309.816 
##       17       18       19       20 
## 2463.197 2623.209 2788.711 2958.782 
## 
## $df
## [1] 18
## 
## $residual.scale
## [1] 6949.78
# create a plot
plot(x,y,xlab="Age",ylab="Income",main="Linear fitted line with confidence and prediction interval",
     ylim=c(5000,55000))
lines(x,yhat,type="o",pch=20)
# 95% confidence interval
points(new2$x,pred.CI$fit[,"lwr"],col="red",type="o",pch=20) 
points(new2$x,pred.CI$fit[,"upr"],col="red",type="o",pch=20)

# 95% prediction interval
points(new2$x,pred.PI$fit[,"lwr"],col="blue",type="o",pch=20) 
points(new2$x,pred.PI$fit[,"upr"],col="blue",type="o",pch=20)

Assignment Lab1

You must submit:

  1. R file with your codes, and

  2. Answer sheet with your handwriting

On Mango canvas, see the deadline there!

A teacher from Statistics class wants to examine the relationship between final and midterm score of her students. The midterm and final scores of 20 randomly chosen students are shown below

The midterm and final scores of 20 students
Student Midterm Final
1 77 82
2 50 66
3 71 78
4 72 43
5 80 56
6 93 85
7 95 99
8 98 98
9 66 68
10 55 45
11 63 75
12 62 45
13 51 71
14 65 60
15 90 87
16 74 68
17 48 56
18 67 85
19 67 60
20 85 90

Use R language to:

  1. Find sample correlation between midterm and final scores and interpret its value.

  2. Test the population correlation whether midterm and final scores are correlated at significance level 0.05 (Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion).

  3. Find the 95% confidence interval of the population correlation.

  4. Make a scatterplot between midterm and final scores. From the scatterplot, describe the form, direction, and strength of the relationship. Is the relationship approximately linear?

  5. Obtain the least-squares regression line for predicting final score from midterm score.

  6. Find the value of MSres

  7. From the regression line, test whether midterm score can be used to described final score at significance level 0.05. (Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion)

  8. Find the 95% confidence interval of regression parameters

  9. Find \(r^{2}\) and interpret its value for these data

  10. Find 95% confidence and prediction interval of the final score for the midterm score of 75.

  11. Plot linear fitted line and their 95% confidence interval and prediction interval for all values of x.

  12. Write a short paragraph to summarize the results.