Module: 208251 Regression Analysis and Non-Parametric Statistics

Instructor: Wisunee Puggard

Affiliation: Department of Statistics, Faculty of Science, Chiang Mai University.

Objectives:

Students are able to use R language to

  1. perform descriptive statistics.

  2. construct scatterplot between two quantitative variables.

  3. perform correlation analysis.

  4. perform linear regression analysis and inference on regression parameters.

  5. interpret the results.

Exercise I

1. Import data into R

You can download data file in Class material on MANGO Canvas

  1. Data file is in excel file .xls, we need to change it to be .csv before importing it into R.

  2. Import data into R

Note that your working directory (the place where the data file is at) will be different from mine.

Obs Area Sale.price 1 1 13 115 2 2 4 45 3 3 12 100 4 4 5 50 5 5 6 55 6 6 8 85 7 7 3 40 8 8 4 50 9 9 5 45 10 10 7 70

Mydata=read.csv('/Users/wisuneepuggard/Desktop/ICAS2024/Chapter2_ExerciseI.csv'
,header=TRUE)
Mydata    #call to see data and variables
##    Obs Area Sale.price
## 1    1   13        115
## 2    2    4         45
## 3    3   12        100
## 4    4    5         50
## 5    5    6         55
## 6    6    8         85
## 7    7    3         40
## 8    8    4         50
## 9    9    5         45
## 10  10    7         70

2. Create scatterplot

x = Mydata$Area  #set x (independent variable)
y = Mydata$Sale.price #set y (dependent variable)

plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")

3. Correlation coefficient

Compute the correlation coefficient and hypothesis test for significant linear correlation

cor(x,y)  
## [1] 0.978902
cor.test(x, y,alternative = c("two.sided"),conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 13.55, df = 8, p-value = 8.451e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9103840 0.9951655
## sample estimates:
##      cor 
## 0.978902

4. Fit regression line

fit = lm(y~x) #model fitting
summary(fit) #some basic information about the model is output
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.738 -4.618  0.987  2.269  9.741 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.202      4.120    3.69  0.00613 ** 
## x              7.507      0.554   13.55 8.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.653 on 8 degrees of freedom
## Multiple R-squared:  0.9582, Adjusted R-squared:  0.953 
## F-statistic: 183.6 on 1 and 8 DF,  p-value: 8.451e-07
confint.lm(fit,level=0.95) #to obtain a confidence interval for the coefficient parameters,
##                2.5 %    97.5 %
## (Intercept) 5.701166 24.702293
## x           6.229628  8.784781

5. Plot the fitted line:

yhat = predict(fit) #compute fitted y
yhat 
##         1         2         3         4         5         6         7         8 
## 112.79539  45.23055 105.28818  52.73775  60.24496  75.25937  37.72334  45.23055 
##         9        10 
##  52.73775  67.75216
plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")
lines(x,yhat,type="o",pch=20)

6. Compute the residuals

res = resid(fit)
res
##          1          2          3          4          5          6          7 
##  2.2046110 -0.2305476 -5.2881844 -2.7377522 -5.2449568  9.7406340  2.2766571 
##          8          9         10 
##  4.7694524 -7.7377522  2.2478386
par(mfrow=c(1,2))  #set plot layout as 1 row 2 column
plot(res)
hist(res)

7. Prediction

with confidence interval and prediction interval at a value of x=13:This returns three values

fit:the point or fitted estimate
lwr: the lower bound of the interval
upr: the upper bound of the interval

new = data.frame(x = 13)
pred = predict(fit, new)
pred
##        1 
## 112.7954
pred.CI = predict(fit, newdata=new, interval = "confidence")
pred.CI 
##        fit      lwr      upr
## 1 112.7954 103.7525 121.8382
pred.PI = predict(fit, newdata=new, interval = "prediction")
pred.PI 
##        fit      lwr    upr
## 1 112.7954 96.93079 128.66

8. Fit the regression line

Fit the regression line and confidence interval and prediction interval for all values of x:

new = data.frame(x = sort(x))
new
##     x
## 1   3
## 2   4
## 3   4
## 4   5
## 5   5
## 6   6
## 7   7
## 8   8
## 9  12
## 10 13
pred.CI = predict(fit, newdata=new, interval = "confidence")
pred.CI
##          fit       lwr       upr
## 1   37.72334  31.45150  43.99519
## 2   45.23055  39.85561  50.60549
## 3   45.23055  39.85561  50.60549
## 4   52.73775  48.07854  57.39697
## 5   52.73775  48.07854  57.39697
## 6   60.24496  56.02702  64.46289
## 7   67.75216  63.61234  71.89198
## 8   75.25937  70.81531  79.70342
## 9  105.28818  97.36103 113.21534
## 10 112.79539 103.75253 121.83825
pred.PI = predict(fit, newdata=new, interval = "prediction")
pred.PI
##          fit      lwr       upr
## 1   37.72334 23.25793  52.18875
## 2   45.23055 31.13083  59.33027
## 3   45.23055 31.13083  59.33027
## 4   52.73775 38.89505  66.58045
## 5   52.73775 38.89505  66.58045
## 6   60.24496 46.54448  73.94543
## 7   67.75216 54.07553  81.42879
## 8   75.25937 61.48759  89.03114
## 9  105.28818 90.03198 120.54439
## 10 112.79539 96.93079 128.65998
# create a plot
plot(x,y,xlab="Area",ylab="Sale price",main="Scatterplot of sale price and area")
lines(x,yhat,type="o",pch=20)
# 95% confidence interval
points(new$x,pred.CI[,"lwr"],col="red",type="o",pch=20) 
points(new$x,pred.CI[,"upr"],col="red",type="o",pch=20)
# 95% prediction interval
points(new$x,pred.PI[,"lwr"],col="blue",type="o",pch=20) 
points(new$x,pred.PI[,"upr"],col="blue",type="o",pch=20)

Exercise II : Job and Life Satisfaction

1. Import data: Data file is provided in MANGO Canvas.

Note that your working directory (the place where the data file is at) will be different from mine.

Mydata2 <- read.csv('/Users/wisuneepuggard/Desktop/ICAS2024/JobSatisfaction.csv'
,header=TRUE)

2. Data exploratory

Mydata2    #To view dataset
##    Obs Age Gender   Status        Education Smoke Income Life.Satisfaction
## 1    1  33   Male  Married        Undergrad    No  15050                 3
## 2    2  27   Male  Married        Undergrad   Yes  20600                 2
## 3    3  22 Female   Single          Master    Yes  15200                 4
## 4    4  34   Male  Married        Undergrad    No  22500                 3
## 5    5  24 Female   Single      High school    No  30400                 4
## 6    6  37 Female Divorced        Undergrad    No  33850                 3
## 7    7  25 Female   Single          Master    Yes  15100                 2
## 8    8  31   Male  Married        Undergrad   Yes  34900                 1
## 9    9  40   Male  Married        Undergrad    No  23200                 1
## 10  10  26   Male   Single Secondary school   Yes  32000                 1
## 11  11  41 Female  Married      High school   Yes  18650                 2
## 12  12  38   Male  Married      High school    No  28570                 2
## 13  13  28 Female   Single        Undergrad    No  26000                 2
## 14  14  44   Male Divorced      High school   Yes  39560                 3
## 15  15  21 Female  Married        Undergrad   Yes  18000                 4
## 16  16  25   Male   Single        Undergrad    No  13500                 2
## 17  17  30 Female   Single Secondary school    No  22350                 1
## 18  18  42   Male  Married        Undergrad   Yes  35000                 3
## 19  19  43 Female  Married        Undergrad    No  28000                 4
## 20  20  20   Male   Single          Master     No  15000                 3
##    Job.Satisfaction
## 1                 3
## 2                 4
## 3                 1
## 4                 4
## 5                 3
## 6                 3
## 7                 3
## 8                 2
## 9                 3
## 10                3
## 11                1
## 12                3
## 13                4
## 14                2
## 15                3
## 16                1
## 17                1
## 18                3
## 19                3
## 20                4
str(Mydata2) #To view type of data
## 'data.frame':    20 obs. of  9 variables:
##  $ Obs              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age              : int  33 27 22 34 24 37 25 31 40 26 ...
##  $ Gender           : chr  "Male" "Male" "Female" "Male" ...
##  $ Status           : chr  "Married" "Married" "Single" "Married" ...
##  $ Education        : chr  "Undergrad" "Undergrad" "Master " "Undergrad" ...
##  $ Smoke            : chr  "No" "Yes" "Yes" "No" ...
##  $ Income           : int  15050 20600 15200 22500 30400 33850 15100 34900 23200 32000 ...
##  $ Life.Satisfaction: int  3 2 4 3 4 3 2 1 1 1 ...
##  $ Job.Satisfaction : int  3 4 1 4 3 3 3 2 3 3 ...
# frequency table of gender
table1 <- table(Mydata2$Gender)
table1
## 
## Female   Male 
##      9     11
prop.table(table1) #compute percentage
## 
## Female   Male 
##   0.45   0.55
# percentages for gender by highest education
table2 <- table(Mydata2$Education, Mydata2$Gender)
table2
##                   
##                    Female Male
##   High school           2    2
##   Master                2    1
##   Secondary school      1    1
##   Undergrad             4    7
prop.table(table2)
##                   
##                    Female Male
##   High school        0.10 0.10
##   Master             0.10 0.05
##   Secondary school   0.05 0.05
##   Undergrad          0.20 0.35
# Descriptive statistics for quantitative data
summary(Mydata2$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   25.00   30.50   31.55   38.50   44.00
summary(Mydata2$Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13500   17300   22850   24372   30800   39560
boxplot(Mydata2$Income)

# Descriptive between qualitative and quantitative
boxplot(Mydata2$Income~Mydata2$Gender,xlab="Gender",ylab="Income")

boxplot(Mydata2$Income~Mydata2$Job.Satisfaction,xlab="Job Satisfaction",ylab="Income")

3. Create scatterplot

x = Mydata2$Age      #set x (independent variable)
y = Mydata2$Income   #set y (dependent variable)

plot(x,y,xlab="Age",ylab="Income",main="Scatterplot of age and income")

4. Compute the correlation coefficient

Compute the correlation coefficient and hypothesis test for significant linear correlation:

cor(x,y)  
## [1] 0.5424278
cor.test(x, y,alternative = c("two.sided"),conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 2.7393, df = 18, p-value = 0.01348
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1314628 0.7942906
## sample estimates:
##       cor 
## 0.5424278

5. Fit regression line

fit = lm(y~x) #model fitting
summary(fit) #some basic information about the model is output
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10957  -4321  -1207   5244  10833 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   6893.1     6567.0   1.050   0.3078  
## x              554.0      202.2   2.739   0.0135 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6950 on 18 degrees of freedom
## Multiple R-squared:  0.2942, Adjusted R-squared:  0.255 
## F-statistic: 7.504 on 1 and 18 DF,  p-value: 0.01348
confint.lm(fit) #to obtain a confidence interval for the coefficient estimates,
##                  2.5 %     97.5 %
## (Intercept) -6903.6540 20689.9379
## x             129.1102   978.8681

6. Plot the fitted line

yhat = predict(fit) #compute fitted y
yhat
##        1        2        3        4        5        6        7        8 
## 25174.78 21850.85 19080.90 25728.77 20188.88 27390.74 20742.87 24066.81 
##        9       10       11       12       13       14       15       16 
## 29052.71 21296.86 29606.70 27944.73 22404.84 31268.67 18526.91 20742.87 
##       17       18       19       20 
## 23512.82 30160.69 30714.68 17972.93
plot(x,y,xlab="Age",ylab="Income",main="Linear fitted line")
lines(x,yhat,type="o",pch=20)

7. Compute the residuals

res = resid(fit)
res
##           1           2           3           4           5           6 
## -10124.7843  -1250.8493  -3880.9035  -3228.7734  10211.1182   6459.2591 
##           7           8           9          10          11          12 
##  -5642.8710  10833.1940  -5852.7084  10703.1398 -10956.6976    625.2699 
##          13          14          15          16          17          18 
##   3595.1615   8291.3349   -526.9143  -7242.8710  -1162.8168   4839.3133 
##          19          20 
##  -2714.6759  -2972.9252
par(mfrow=c(1,2))  #set plot layout as 1 row 2 column
plot(res)
hist(res)

8. Prediction

Prediction with confidence interval and prediction interval for all values of x:

new2 = data.frame(x = sort(x))
new2
##     x
## 1  20
## 2  21
## 3  22
## 4  24
## 5  25
## 6  25
## 7  26
## 8  27
## 9  28
## 10 30
## 11 31
## 12 33
## 13 34
## 14 37
## 15 38
## 16 40
## 17 41
## 18 42
## 19 43
## 20 44
pred.CI = predict(fit, newdata=new2, interval = "confidence",se.fit=TRUE)
pred.CI
## $fit
##         fit      lwr      upr
## 1  17972.93 12078.73 23867.12
## 2  18526.91 12981.47 24072.36
## 3  19080.90 13872.89 24288.92
## 4  20188.88 15611.81 24765.95
## 5  20742.87 16452.86 25032.89
## 6  20742.87 16452.86 25032.89
## 7  21296.86 17269.46 25324.26
## 8  21850.85 18056.56 25645.14
## 9  22404.84 18808.39 26001.28
## 10 23512.82 20182.19 26843.45
## 11 24066.81 20793.58 27340.03
## 12 25174.78 21852.30 28497.27
## 13 25728.77 22301.97 29155.57
## 14 27390.74 23388.07 31393.41
## 15 27944.73 23682.15 32207.31
## 16 29052.71 24199.97 33905.45
## 17 29606.70 24431.71 34781.68
## 18 30160.69 24649.53 35671.84
## 19 30714.68 24855.81 36573.54
## 20 31268.67 25052.50 37484.83
## 
## $se.fit
##        1        2        3        4        5        6        7        8 
## 2805.526 2639.529 2478.920 2178.602 2041.968 2041.968 1916.966 1806.012 
##        9       10       11       12       13       14       15       16 
## 1711.841 1585.317 1557.994 1581.443 1631.094 1905.195 2028.907 2309.816 
##       17       18       19       20 
## 2463.197 2623.209 2788.711 2958.782 
## 
## $df
## [1] 18
## 
## $residual.scale
## [1] 6949.78
pred.PI = predict(fit, newdata=new2, interval = "prediction",se.fit=TRUE)
pred.PI
## $fit
##         fit       lwr      upr
## 1  17972.93  2227.160 33718.69
## 2  18526.91  2908.349 34145.48
## 3  19080.90  3578.933 34582.87
## 4  20188.88  4887.339 35490.43
## 5  20742.87  5524.728 35961.01
## 6  20742.87  5524.728 35961.01
## 7  21296.86  6150.654 36443.07
## 8  21850.85  6764.953 36936.75
## 9  22404.84  7367.484 37442.19
## 10 23512.82  8536.812 38488.82
## 11 24066.81  9103.463 39030.15
## 12 25174.78 10200.588 40148.98
## 13 25728.77 10731.087 40726.46
## 14 27390.74 12251.092 42530.39
## 15 27944.73 12734.300 43155.16
## 16 29052.71 13666.457 44438.96
## 17 29606.70 14115.794 45097.60
## 18 30160.69 14554.262 45767.11
## 19 30714.68 14982.101 46447.25
## 20 31268.67 15399.563 47137.77
## 
## $se.fit
##        1        2        3        4        5        6        7        8 
## 2805.526 2639.529 2478.920 2178.602 2041.968 2041.968 1916.966 1806.012 
##        9       10       11       12       13       14       15       16 
## 1711.841 1585.317 1557.994 1581.443 1631.094 1905.195 2028.907 2309.816 
##       17       18       19       20 
## 2463.197 2623.209 2788.711 2958.782 
## 
## $df
## [1] 18
## 
## $residual.scale
## [1] 6949.78
# create a plot
plot(x,y,xlab="Age",ylab="Income",main="Linear fitted line with confidence and prediction interval",
     ylim=c(5000,55000))
lines(x,yhat,type="o",pch=20)
# 95% confidence interval
points(new2$x,pred.CI$fit[,"lwr"],col="red",type="o",pch=20) 
points(new2$x,pred.CI$fit[,"upr"],col="red",type="o",pch=20)

# 95% prediction interval
points(new2$x,pred.PI$fit[,"lwr"],col="blue",type="o",pch=20) 
points(new2$x,pred.PI$fit[,"upr"],col="blue",type="o",pch=20)

Assignment Lab1

You must submit:

  1. R file with your codes, and

  2. Answer sheet with your handwriting

On Mango canvas, see the deadline there!

A teacher from Statistics class wants to examine the relationship between final and midterm score of her students. The midterm and final scores of 20 randomly chosen students are shown below

The midterm and final scores of 20 students
Student Midterm Final
1 77 82
2 50 66
3 71 78
4 72 43
5 80 56
6 93 85
7 95 99
8 98 98
9 66 68
10 55 45
11 63 75
12 62 45
13 51 71
14 65 60
15 90 87
16 74 68
17 48 56
18 67 85
19 67 60
20 85 90

Use R language to:

  1. Find sample correlation between midterm and final scores and interpret its value.

  2. Test the population correlation whether midterm and final scores are correlated at significance level 0.05 (Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion).

  3. Find the 95% confidence interval of the population correlation.

  4. Make a scatterplot between midterm and final scores. From the scatterplot, describe the form, direction, and strength of the relationship. Is the relationship approximately linear?

  5. Obtain the least-squares regression line for predicting final score from midterm score.

  6. Find the value of MSres

  7. From the regression line, test whether midterm score can be used to described final score at significance level 0.05. (Perform the 4-step process (state hypotheses, give a test statistic and P-value, and state your conclusion)

  8. Find the 95% confidence interval of regression parameters

  9. Find 𝑟2 and interpret its value for these data

  10. Find 95% confidence and prediction interval of the final score for the midterm score of 75.

  11. Plot linear fitted line and their 95% confidence interval and prediction interval for all values of x.

  12. Write a short paragraph to summarize the results.