Week 1 Discussion - Corinne Willis

Load Data (1 of 2)

The ShipAccidents data frame contains 40 observations of 5 ship types from 4 construction periods and 2 operation periods.

The data tracks each ship’s aggregate months of service and number of damage indicents.

ShipAccidents is time-series data as it consists of a sample of observations over successive periods of time.

Variable Definitions

  • type - factor with levels "A" to "E" for the different ship types.

  • construction - factor with levels "1960-64", "1965-69", "1970-74", "1975-79" for the periods of construction.

  • operation - factor with levels "1960-74", "1975-79" for the periods of operation.

  • service - a numeric vector of aggregate months of service.

  • incidents - a numeric vector of number of damage incidents.

install.packages("AER", repos='https://lib.stat.cmu.edu/R/CRAN/')

The downloaded binary packages are in
    /var/folders/lt/xj2_66qx14dft54l6h1tnswc0000gn/T//RtmppC5S84/downloaded_packages
library("AER")
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data("ShipAccidents")
help("ShipAccidents", package = "AER")
head(ShipAccidents)
  type construction operation service incidents
1    A      1960-64   1960-74     127         0
2    A      1960-64   1975-79      63         0
3    A      1965-69   1960-74    1095         3
4    A      1965-69   1975-79    1095         4
5    A      1970-74   1960-74    1512         6
6    A      1970-74   1975-79    3353        18
tail(ShipAccidents)
   type construction operation service incidents
35    E      1965-69   1960-74     789         7
36    E      1965-69   1975-79     437         7
37    E      1970-74   1960-74    1157         5
38    E      1970-74   1975-79    2161        12
39    E      1975-79   1960-74       0         0
40    E      1975-79   1975-79     542         1

Exploratory Data Analysis (EDA)

str(ShipAccidents)
'data.frame':   40 obs. of  5 variables:
 $ type        : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ construction: Factor w/ 4 levels "1960-64","1965-69",..: 1 1 2 2 3 3 4 4 1 1 ...
 $ operation   : Factor w/ 2 levels "1960-74","1975-79": 1 2 1 2 1 2 1 2 1 2 ...
 $ service     : num  127 63 1095 1095 1512 ...
 $ incidents   : num  0 0 3 4 6 18 0 11 39 29 ...
require(psych)
Loading required package: psych

Attaching package: 'psych'
The following object is masked from 'package:car':

    logit
describe(ShipAccidents)
              vars  n    mean      sd median trimmed     mad min   max range
type*            1 40    3.00    1.43      3    3.00    1.48   1     5     4
construction*    2 40    2.58    1.13      3    2.59    1.48   1     4     3
operation*       3 40    1.52    0.51      2    1.53    0.00   1     2     1
service          4 40 4089.35 9040.32    782 1641.78 1074.14   0 44882 44882
incidents        5 40    8.90   14.96      2    5.06    2.97   0    58    58
               skew kurtosis      se
type*          0.00    -1.38    0.23
construction* -0.08    -1.43    0.18
operation*    -0.10    -2.04    0.08
service        3.00     9.13 1429.40
incidents      2.02     3.10    2.37
summary(ShipAccidents)
 type   construction   operation     service          incidents   
 A:8   1960-64: 9    1960-74:19   Min.   :    0.0   Min.   : 0.0  
 B:8   1965-69:10    1975-79:21   1st Qu.:  175.8   1st Qu.: 0.0  
 C:8   1970-74:10                 Median :  782.0   Median : 2.0  
 D:8   1975-79:11                 Mean   : 4089.3   Mean   : 8.9  
 E:8                              3rd Qu.: 2078.5   3rd Qu.:11.0  
                                  Max.   :44882.0   Max.   :58.0  
library(ggplot2)

Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':

    %+%, alpha
ggplot(ShipAccidents, aes(x = incidents,y = construction)) +
geom_line() +
labs(title = "Incidents by Construction Period")

library(ggplot2)
ggplot(ShipAccidents, aes(x = incidents,y = operation)) +
geom_line() +
labs(title = "Incidents by Operation Period")

plot(ShipAccidents$incidents, ShipAccidents$service, main="Incidents by Months of Service",
   xlab="# of Damage Incidents", ylab="Aggregate Months of Service", pch=19)

Load Data (2 of 2)

The Orange data frame contains 35 rows and 3 columns of records of the growth of orange trees.

The data tracks each tree’s age and circumference.

Orange is cross-sectional data as it consists of a sample of observations on individual units taken at a single point in time.

Variable Definitions

  • tree - ordered factor indicating the tree on which the measurement is made. The ordering is according to increasing maximum diameter.

  • age - numeric vector giving the age of the tree (days since 1968/12/31)

  • circumference - numeric vector of trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.

?datasets()
library(help = "datasets")
Orange_Data <- Orange
help("Orange")
head(Orange_Data)
Grouped Data: circumference ~ age | Tree
  Tree  age circumference
1    1  118            30
2    1  484            58
3    1  664            87
4    1 1004           115
5    1 1231           120
6    1 1372           142
tail(Orange_Data)
Grouped Data: circumference ~ age | Tree
   Tree  age circumference
30    5  484            49
31    5  664            81
32    5 1004           125
33    5 1231           142
34    5 1372           174
35    5 1582           177

Exploratory Data Analysis (EDA)

str(Orange_Data)
Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':  35 obs. of  3 variables:
 $ Tree         : Ord.factor w/ 5 levels "3"<"1"<"5"<"2"<..: 2 2 2 2 2 2 2 4 4 4 ...
 $ age          : num  118 484 664 1004 1231 ...
 $ circumference: num  30 58 87 115 120 142 145 33 69 111 ...
 - attr(*, "formula")=Class 'formula'  language circumference ~ age | Tree
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time since December 31, 1968"
  ..$ y: chr "Trunk circumference"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(mm)"
require(psych)
describe(Orange_Data)
              vars  n   mean     sd median trimmed    mad min  max range  skew
Tree*            1 35   3.00   1.43      3    3.00   1.48   1    5     4  0.00
age              2 35 922.14 491.86   1004  937.07 545.60 118 1582  1464 -0.26
circumference    3 35 115.86  57.49    115  115.14  77.10  30  214   184  0.00
              kurtosis    se
Tree*            -1.40  0.24
age              -1.29 83.14
circumference    -1.24  9.72
summary(Orange_Data)
 Tree       age         circumference  
 3:7   Min.   : 118.0   Min.   : 30.0  
 1:7   1st Qu.: 484.0   1st Qu.: 65.5  
 5:7   Median :1004.0   Median :115.0  
 2:7   Mean   : 922.1   Mean   :115.9  
 4:7   3rd Qu.:1372.0   3rd Qu.:161.5  
       Max.   :1582.0   Max.   :214.0  
plot(Orange_Data$age, Orange_Data$circumference, main="Circumference of Orange Tree by Age",
   xlab="Age of Orange Tree", ylab="Circumference", pch=19)

Linear Regression

We will estimate the following linear regression with Orange Tree circumference as the dependent variable and Orange Tree Age as the independent variable.

\[ Circumference_i = \beta_0 + \beta_1 Age_i + \epsilon_i \]

mylm <- lm(data = Orange_Data,                
   formula =  circumference ~ age)
mylm

Call:
lm(formula = circumference ~ age, data = Orange_Data)

Coefficients:
(Intercept)          age  
    17.3997       0.1068  

\(\beta_1\) is 0.1068 in the chart above.

summary(mylm)

Call:
lm(formula = circumference ~ age, data = Orange_Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.310 -14.946  -0.076  19.697  45.111 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17.399650   8.622660   2.018   0.0518 .  
age          0.106770   0.008277  12.900 1.93e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.74 on 33 degrees of freedom
Multiple R-squared:  0.8345,    Adjusted R-squared:  0.8295 
F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Covariance-Variance formula for slope

beta1 <- cov(Orange_Data$circumference, Orange_Data$age)/var(Orange_Data$age)
round(beta1, digits = 4)
[1] 0.1068
# Plot

?plot
Help on topic 'plot' was found in the following packages:

  Package               Library
  graphics              /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
  base                  /Library/Frameworks/R.framework/Resources/library


Using the first match ...
plot(x = Orange_Data$age, 
     y = Orange_Data$circumference,
     xlab = "Orange Tree Age",
     ylab = "Orange Tree Circumference",
     main = "Best Fit Line",
     sub = "",
     bg  = "lightblue",   # a vector of background colors (Graphical Parameters)
     col = "black",       # the colors for lines and points (Graphical Parameters)
     cex = .9,            # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)
     pch = 21,            # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}
     frame = TRUE         # frame.plot    - a logical indicating whether a box should be drawn around the plot.
     )


?abline
abline(mylm,
       lwd = 2,                    # line width, default = 1
       col = "blue"
       )

Finding the Intercept

We know that

\[ \begin{align*} y_i & = \beta_0 + \beta_1 x_i \\ <=> \bar{y} & = \beta_0 + \beta_1 \bar{x} \\ \end{align*} \]

Rearrange the terms and solve for \(\beta_0\):

\[ \beta_0 = \bar{y}-\bar{x}*\beta_1 \]

beta0 <- mean(Orange_Data$circumference) - beta1 * mean(Orange_Data$age)
round(beta0, digits = 4)    # print the value of beta_0
[1] 17.3997
# beta_0 value should match the intercept value from the lm command
mylm

Call:
lm(formula = circumference ~ age, data = Orange_Data)

Coefficients:
(Intercept)          age  
    17.3997       0.1068  

Plotting the data and best fit line

?plot
Help on topic 'plot' was found in the following packages:

  Package               Library
  graphics              /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
  base                  /Library/Frameworks/R.framework/Resources/library


Using the first match ...
plot(x = Orange_Data$age, 
     y = Orange_Data$circumference,
     xlab = "Orange Tree Age",
     ylab = "Orange Tree Circumference",
     main = "Best Fit Line",
     sub = "",
     bg  = "lightblue",   # a vector of background colors (Graphical Parameters)
     col = "black",       # the colors for lines and points (Graphical Parameters)
     cex = .9,            # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)
     pch = 21,            # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}
     frame = TRUE         # frame.plot    - a logical indicating whether a box should be drawn around the plot.
     )


?abline
abline(mylm,
       lwd = 2,                    # line width, default = 1
       col = "blue"
       )

Conclusion

From the results derived from the both the lm() function and the mathematical equation, we can see that \(\beta_0\) and \(\beta_1\) are exactly the same when derived either way. Below are some important distinctions -

  • Covariance - Covariance is a measure of joint variability meaning that it shows how two variables (x, y) change with respect to each other. The covariance for two variables is calculated as the average of the product of the differences from the mean.

  • Variance - Variance is a measure of dispursion meaning that it shows how spread out the data (x variables) are from the average value or the mean. The variance of a single variable is calculated as the average of the squared differences from the mean.

  • When you divide the covariance of y and x by the variance of x, you are scaling the covariance by the variability of x. This results in the average change in y with respect to a unit change in x, which is exactly what the slope coefficient \(\beta_1\) estimates in a simple linear regression model.

References

Waples, J. (2023, February 2). Variance, Covariance, Standard Deviation, Correlation and Regression in R using mtcars. https://medium.com/@josef.waples/variance-covariance-standard-deviation-correlation-and-regression-in-r-rstudio-using-mtcars-1507a17e4fbc