The data come from the R base. type ?Orange to see the description of the data in R.
head(Orange) # looking at the first 6 rows (6 is the default)
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
dim(Orange) # Checking the size (35 rows, and 3 columns)
## [1] 35 3
sum(is.na(Orange)) # Checking if the number of mssing values, if any (it's complete)
## [1] 0
sapply(Orange, class) # class of the variables (Trees is categorical. Others are numeric)
## $Tree
## [1] "ordered" "factor"
##
## $age
## [1] "numeric"
##
## $circumference
## [1] "numeric"
It’s time to check how many per categories we have in Trees
with(Orange, table(Tree)) # there are 1-5 categories, each with 7 frequencies
## Tree
## 3 1 5 2 4
## 7 7 7 7 7
Let’s check the relationshjip between age and circumference
with(Orange, plot(age, circumference), main = "Scatter Plot")
It looks like age is set by categories. But let’s check on that
with(Orange, hist(age))
unique(Orange$age) # we have about 7 different values for age..
## [1] 118 484 664 1004 1231 1372 1582
with(Orange, hist(circumference)) # we really do not have enough data point to generate a nice plot, but we can try
Adding a regression line to see a little better, a little more sophicticated, so we use ggplot2
library(ggplot2)
ggplot(Orange, aes(age, circumference)) + geom_point() +
geom_smooth(method = "lm", se = F) + ggtitle("Scatter Plot")
Let’s try to predict the circumference using age. t is a simple linear regression
my.model <- lm(circumference ~ 1 + age, data = Orange)
summary(my.model)
##
## Call:
## lm(formula = circumference ~ 1 + age, data = Orange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.310 -14.946 -0.076 19.697 45.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.399650 8.622660 2.018 0.0518 .
## age 0.106770 0.008277 12.900 1.93e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295
## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
Interpreting the result: The coefficient of determination is 0.8345. This means 83% of the variation in circumference can be predicted by the variation in age. The intercept is not interpetable inthis situation because a tree with zero age does not have a circumference. We could overcome this limitation by centenring the age variable. We will look at how to do that ina different demonstration. The slope is 0.11. This means the predicted circuference is expected to increase by 0.11 for a unit increase in age. This is significant at the alpha level of 0.05.
Before we can trust this interpretation, the assumnption of inear regression must hold to a reasonable extent. The technique used here assumes that the residuals follow normal distribution witha mean of zero and a constant variance. Lte’s check that!
library(broom)
out <- augment(my.model)
head(out)
## # A tibble: 6 x 9
## circumference age .fitted .se.fit .resid .hat .sigma .cooksd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 30 118 30.0 7.77 1.45e-3 0.107 24.1 2.51e-10
## 2 58 484 69.1 5.41 -1.11e+1 0.0519 24.0 6.29e- 3
## 3 87 664 88.3 4.55 -1.30e+0 0.0367 24.1 5.88e- 5
## 4 115 1004 125. 4.07 -9.60e+0 0.0294 24.0 2.55e- 3
## 5 120 1231 149. 4.76 -2.88e+1 0.0402 23.5 3.22e- 2
## 6 142 1372 164. 5.47 -2.19e+1 0.0532 23.8 2.52e- 2
## # … with 1 more variable: .std.resid <dbl>
ggplot(out, aes(.resid)) + geom_histogram() # not too good, but we have only 35 observations. So this is okay. Best is to have much more data for better evaluation.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(out, aes(.fitted, .resid)) + geom_point() +
ylab("Residuals") +
xlab("Predicted values") +
ggtitle("Residuals plot")
Residual plots should not show any pattern. Again, we 35 observations, it is not feasible to evaluate the assumptions well. The hope is that with more data, the assumption will hold. So the result of the analysis must me interpreted with caution at this point, based on the data used.