Split the dataset into two subsets, one as historical data, one as testing data. Variable “xyz” means the sum of variable x, y and z.

library(ggplot2)
library(car)
str(diamonds)
## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
diamonds$xyz = with(diamonds, diamonds$x + diamonds$y + diamonds$z)
h_diamonds = diamonds[0:40000,] #historical data
t_diamonds = diamonds[40000:53940,]  #test data

Using the historical data to build a prediction model

qplot(xyz, carat, data = h_diamonds)

plot of chunk unnamed-chunk-2

fit = lm(xyz~carat, data=h_diamonds)

qq plot for studentized resid

qqPlot(fit, main="QQ Plot")

plot of chunk unnamed-chunk-3

component + residual plot

crPlots(fit)

plot of chunk unnamed-chunk-4

leverage plot

leveragePlots(fit)

plot of chunk unnamed-chunk-5