Data Exploration and Simple Prediction

The data come from the R base. type ?Orange to see the description of the data in R.

head(Orange) # looking at the first 6 rows (6 is the default)

##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142

dim(Orange) # Checking the size (35 rows, and 3 columns)

## [1] 35  3

sum(is.na(Orange)) # Checking if the number of mssing values, if any (it's complete)

## [1] 0

sapply(Orange, class) # class of the variables (Trees is categorical. Others are numeric)

## $Tree
## [1] "ordered" "factor" 
## 
## $age
## [1] "numeric"
## 
## $circumference
## [1] "numeric"

It’s time to check how many per categories we have in Trees

with(Orange, table(Tree)) # there are 1-5 categories, each with 7 frequencies

## Tree
## 3 1 5 2 4 
## 7 7 7 7 7

Let’s check the relationshjip between age and circumference

with(Orange, plot(age, circumference), main = "Scatter Plot")

It looks like age is set by categories. But let’s check on that

with(Orange, hist(age))

unique(Orange$age) # we have about 7 different values for age..

## [1]  118  484  664 1004 1231 1372 1582

with(Orange, hist(circumference)) # we really do not have enough data point to generate a nice plot, but we can try

Adding a regression line to see a little better, a little more sophicticated, so we use ggplot2

library(ggplot2)
ggplot(Orange, aes(age, circumference)) + geom_point() + 
  geom_smooth(method = "lm", se = F) + ggtitle("Scatter Plot")

Let’s try to predict the circumference using age. t is a simple linear regression

my.model <- lm(circumference ~ 1 + age, data = Orange)
summary(my.model)

## 
## Call:
## lm(formula = circumference ~ 1 + age, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.310 -14.946  -0.076  19.697  45.111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.399650   8.622660   2.018   0.0518 .  
## age          0.106770   0.008277  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.74 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Interpreting the result: The coefficient of determination is 0.8345. This means 83% of the variation in circumference can be predicted by the variation in age. The intercept is not interpetable inthis situation because a tree with zero age does not have a circumference. We could overcome this limitation by centenring the age variable. We will look at how to do that ina different demonstration. The slope is 0.11. This means the predicted circuference is expected to increase by 0.11 for a unit increase in age. This is significant at the alpha level of 0.05.

Before we can trust this interpretation, the assumnption of inear regression must hold to a reasonable extent. The technique used here assumes that the residuals follow normal distribution witha mean of zero and a constant variance. Lte’s check that!

library(broom)
out <- augment(my.model)
head(out)

## # A tibble: 6 x 9
##   circumference   age .fitted .se.fit   .resid   .hat .sigma  .cooksd
##           <dbl> <dbl>   <dbl>   <dbl>    <dbl>  <dbl>  <dbl>    <dbl>
## 1            30   118    30.0    7.77  1.45e-3 0.107    24.1 2.51e-10
## 2            58   484    69.1    5.41 -1.11e+1 0.0519   24.0 6.29e- 3
## 3            87   664    88.3    4.55 -1.30e+0 0.0367   24.1 5.88e- 5
## 4           115  1004   125.     4.07 -9.60e+0 0.0294   24.0 2.55e- 3
## 5           120  1231   149.     4.76 -2.88e+1 0.0402   23.5 3.22e- 2
## 6           142  1372   164.     5.47 -2.19e+1 0.0532   23.8 2.52e- 2
## # … with 1 more variable: .std.resid <dbl>

ggplot(out, aes(.resid)) + geom_histogram() # not too good, but we have only 35 observations. So this is okay. Best is to have much more data for better evaluation.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(out, aes(.fitted, .resid)) + geom_point() +
  ylab("Residuals") +
  xlab("Predicted values") +
  ggtitle("Residuals plot")

Residual plots should not show any pattern. Again, we 35 observations, it is not feasible to evaluate the assumptions well. The hope is that with more data, the assumption will hold. So the result of the analysis must me interpreted with caution at this point, based on the data used.

Data Exploration and Simple Prediction

J Mess