type construction operation service incidents
1 A 1960-64 1960-74 127 0
2 A 1960-64 1975-79 63 0
3 A 1965-69 1960-74 1095 3
4 A 1965-69 1975-79 1095 4
5 A 1970-74 1960-74 1512 6
6 A 1970-74 1975-79 3353 18
tail(ShipAccidents)
type construction operation service incidents
35 E 1965-69 1960-74 789 7
36 E 1965-69 1975-79 437 7
37 E 1970-74 1960-74 1157 5
38 E 1970-74 1975-79 2161 12
39 E 1975-79 1960-74 0 0
40 E 1975-79 1975-79 542 1
plot(ShipAccidents$incidents, ShipAccidents$service, main="Incidents by Months of Service",xlab="# of Damage Incidents", ylab="Aggregate Months of Service", pch=19)
Load Data (2 of 2)
The Orange data frame contains 35 rows and 3 columns of records of the growth of orange trees.
The data tracks each tree’s age and circumference.
Orange is cross-sectional data as it consists of a sample of observations on individual units taken at a single point in time.
Variable Definitions
tree - ordered factor indicating the tree on which the measurement is made. The ordering is according to increasing maximum diameter.
age - numeric vector giving the age of the tree (days since 1968/12/31)
circumference - numeric vector of trunk circumferences (mm). This is probably “circumference at breast height”, a standard measurement in forestry.
Grouped Data: circumference ~ age | Tree
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
tail(Orange_Data)
Grouped Data: circumference ~ age | Tree
Tree age circumference
30 5 484 49
31 5 664 81
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
Exploratory Data Analysis (EDA)
str(Orange_Data)
Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 35 obs. of 3 variables:
$ Tree : Ord.factor w/ 5 levels "3"<"1"<"5"<"2"<..: 2 2 2 2 2 2 2 4 4 4 ...
$ age : num 118 484 664 1004 1231 ...
$ circumference: num 30 58 87 115 120 142 145 33 69 111 ...
- attr(*, "formula")=Class 'formula' language circumference ~ age | Tree
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "labels")=List of 2
..$ x: chr "Time since December 31, 1968"
..$ y: chr "Trunk circumference"
- attr(*, "units")=List of 2
..$ x: chr "(days)"
..$ y: chr "(mm)"
require(psych)describe(Orange_Data)
vars n mean sd median trimmed mad min max range skew
Tree* 1 35 3.00 1.43 3 3.00 1.48 1 5 4 0.00
age 2 35 922.14 491.86 1004 937.07 545.60 118 1582 1464 -0.26
circumference 3 35 115.86 57.49 115 115.14 77.10 30 214 184 0.00
kurtosis se
Tree* -1.40 0.24
age -1.29 83.14
circumference -1.24 9.72
summary(Orange_Data)
Tree age circumference
3:7 Min. : 118.0 Min. : 30.0
1:7 1st Qu.: 484.0 1st Qu.: 65.5
5:7 Median :1004.0 Median :115.0
2:7 Mean : 922.1 Mean :115.9
4:7 3rd Qu.:1372.0 3rd Qu.:161.5
Max. :1582.0 Max. :214.0
plot(Orange_Data$age, Orange_Data$circumference, main="Circumference of Orange Tree by Age",xlab="Age of Orange Tree", ylab="Circumference", pch=19)
Linear Regression
We will estimate the following linear regression with Orange Tree circumference as the dependent variable and Orange Tree Age as the independent variable.
Help on topic 'plot' was found in the following packages:
Package Library
graphics /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
base /Library/Frameworks/R.framework/Resources/library
Using the first match ...
plot(x = Orange_Data$age, y = Orange_Data$circumference,xlab ="Orange Tree Age",ylab ="Orange Tree Circumference",main ="Best Fit Line",sub ="",bg ="lightblue", # a vector of background colors (Graphical Parameters)col ="black", # the colors for lines and points (Graphical Parameters)cex = .9, # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)pch =21, # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}frame =TRUE# frame.plot - a logical indicating whether a box should be drawn around the plot. )?ablineabline(mylm,lwd =2, # line width, default = 1col ="blue" )
beta0 <-mean(Orange_Data$circumference) - beta1 *mean(Orange_Data$age)round(beta0, digits =4) # print the value of beta_0
[1] 17.3997
# beta_0 value should match the intercept value from the lm commandmylm
Call:
lm(formula = circumference ~ age, data = Orange_Data)
Coefficients:
(Intercept) age
17.3997 0.1068
Plotting the data and best fit line
?plot
Help on topic 'plot' was found in the following packages:
Package Library
graphics /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
base /Library/Frameworks/R.framework/Resources/library
Using the first match ...
plot(x = Orange_Data$age, y = Orange_Data$circumference,xlab ="Orange Tree Age",ylab ="Orange Tree Circumference",main ="Best Fit Line",sub ="",bg ="lightblue", # a vector of background colors (Graphical Parameters)col ="black", # the colors for lines and points (Graphical Parameters)cex = .9, # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)pch =21, # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}frame =TRUE# frame.plot - a logical indicating whether a box should be drawn around the plot. )?ablineabline(mylm,lwd =2, # line width, default = 1col ="blue" )
Conclusion
From the results derived from the both the lm() function and the mathematical equation, we can see that \(\beta_0\) and \(\beta_1\) are exactly the same when derived either way. Below are some important distinctions -
Covariance - Covariance is a measure of joint variability meaning that it shows how two variables (x, y) change with respect to each other. The covariance for two variables is calculated as the average of the product of the differences from the mean.
Variance - Variance is a measure of dispursion meaning that it shows how spread out the data (x variables) are from the average value or the mean. The variance of a single variable is calculated as the average of the squared differences from the mean.
When you divide the covariance of y and x by the variance of x, you are scaling the covariance by the variability of x. This results in the average change in y with respect to a unit change in x, which is exactly what the slope coefficient \(\beta_1\) estimates in a simple linear regression model.