Creating a linear model -Module2/Sprint1

The first sep was to load the data

cars

##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

we tried to know the data by getting the attributes, summary, structure, names of attributes and names of columns

attributes(cars)

## $names
## [1] "speed" "dist" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

names(cars)

## [1] "speed" "dist"

cars$speed

##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25

cars$dist

##  [1]   2  10   4  22  16  10  18  26  34  17  28  14  20  24  28  26  34
## [18]  34  46  26  36  60  80  20  26  54  32  40  32  40  50  42  56  76
## [35]  84  36  46  68  32  48  52  56  64  66  54  70  92  93 120  85

Findings Attributes showed us the column names, the class of data and the row names. The summary gave us vital details about the cars speed and distance. The structure told us the number of observations and variables.

Plots were then created to view the dataset

hist(cars$speed)

hist(cars$dist)

plot(cars$speed,cars$dist)

qqnorm(cars$speed)

qqnorm(cars$dist)

cars$dist <- as.numeric(cars$dist)
cars$dist

##  [1]   2  10   4  22  16  10  18  26  34  17  28  14  20  24  28  26  34
## [18]  34  46  26  36  60  80  20  26  54  32  40  32  40  50  42  56  76
## [35]  84  36  46  68  32  48  52  56  64  66  54  70  92  93 120  85

names(cars) <- c("SPEED", "DISTANCE")
names(cars)

## [1] "SPEED"    "DISTANCE"

Findings From the histogram plot, it was observed that the most frequent speed was between 10 and 20, most frequent distance was between 20 and 40. Distance was also shown to increase as speed increased.

Preprocessing of the data was then done to identify the missing values, while some data types were changed as appropriate.

summary(cars)

##      SPEED         DISTANCE     
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

is.na(cars)

##       SPEED DISTANCE
##  [1,] FALSE    FALSE
##  [2,] FALSE    FALSE
##  [3,] FALSE    FALSE
##  [4,] FALSE    FALSE
##  [5,] FALSE    FALSE
##  [6,] FALSE    FALSE
##  [7,] FALSE    FALSE
##  [8,] FALSE    FALSE
##  [9,] FALSE    FALSE
## [10,] FALSE    FALSE
## [11,] FALSE    FALSE
## [12,] FALSE    FALSE
## [13,] FALSE    FALSE
## [14,] FALSE    FALSE
## [15,] FALSE    FALSE
## [16,] FALSE    FALSE
## [17,] FALSE    FALSE
## [18,] FALSE    FALSE
## [19,] FALSE    FALSE
## [20,] FALSE    FALSE
## [21,] FALSE    FALSE
## [22,] FALSE    FALSE
## [23,] FALSE    FALSE
## [24,] FALSE    FALSE
## [25,] FALSE    FALSE
## [26,] FALSE    FALSE
## [27,] FALSE    FALSE
## [28,] FALSE    FALSE
## [29,] FALSE    FALSE
## [30,] FALSE    FALSE
## [31,] FALSE    FALSE
## [32,] FALSE    FALSE
## [33,] FALSE    FALSE
## [34,] FALSE    FALSE
## [35,] FALSE    FALSE
## [36,] FALSE    FALSE
## [37,] FALSE    FALSE
## [38,] FALSE    FALSE
## [39,] FALSE    FALSE
## [40,] FALSE    FALSE
## [41,] FALSE    FALSE
## [42,] FALSE    FALSE
## [43,] FALSE    FALSE
## [44,] FALSE    FALSE
## [45,] FALSE    FALSE
## [46,] FALSE    FALSE
## [47,] FALSE    FALSE
## [48,] FALSE    FALSE
## [49,] FALSE    FALSE
## [50,] FALSE    FALSE

Findings There was no missing data.

The dataset was then split. At this point the training set and the test sets were then created using the set.seed() function. The seed is a number that you choose for a starting point used to create a sequence of random numbers. It is also helpful for others who want to recreate your same results. Here is the function. A common set.seed number is 123.

set.seed(123)

The splitting then occurs in order to prepare the dataset for modelling. the split used was the 70/30 split. However, the 80/20 split can also be used.The two codes below actually help to calculate the sizes of each set

trainSize <- round(nrow(cars) * 0.7)
testSize <- nrow(cars) - trainSize

The number of instances in each set was then confirmed

trainSize

## [1] 35

testSize

## [1] 15

Finding The training size was 35 and the test size was 15

Then the training and test sets were created in a randomized order

training_indices <- sample(seq_len(nrow(cars)), size = trainSize)
trainSet <- cars[training_indices,]
testSet <- cars[-training_indices]

A model was then created using the linear model function

model1 <- lm(DISTANCE ~ SPEED, trainSet)

after which the performance of the model was evaluated

summary(model1)

## 
## Call:
## lm(formula = DISTANCE ~ SPEED, data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.820  -8.798  -2.272   5.614  44.951 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22.0481     7.4169  -2.973  0.00548 ** 
## SPEED         4.0457     0.4589   8.817 3.44e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.08 on 33 degrees of freedom
## Multiple R-squared:  0.702,  Adjusted R-squared:  0.693 
## F-statistic: 77.73 on 1 and 33 DF,  p-value: 3.435e-10

Predictions were then carried out using the predict()

distance_covered <- predict(model1, testSet)
distance_covered

##         1         2         3         4         5         6         7 
## -5.865260 -5.865260  6.271871  6.271871 10.317581 14.363291 18.409001 
##         8         9        10        11        12        13        14 
## 18.409001 18.409001 22.454712 22.454712 26.500422 26.500422 26.500422 
##        15        16        17        18        19        20        21 
## 26.500422 30.546132 30.546132 30.546132 30.546132 34.591842 34.591842 
##        22        23        24        25        26        27        28 
## 34.591842 34.591842 38.637553 38.637553 38.637553 42.683263 42.683263 
##        29        30        31        32        33        34        35 
## 46.728973 46.728973 46.728973 50.774684 50.774684 50.774684 50.774684 
##        36        37        38        39        40        41        42 
## 54.820394 54.820394 54.820394 58.866104 58.866104 58.866104 58.866104 
##        43        44        45        46        47        48        49 
## 58.866104 66.957525 71.003235 75.048945 75.048945 75.048945 75.048945 
##        50 
## 79.094655

Predictions were then viewed using the predictions name

distance_covered

##         1         2         3         4         5         6         7 
## -5.865260 -5.865260  6.271871  6.271871 10.317581 14.363291 18.409001 
##         8         9        10        11        12        13        14 
## 18.409001 18.409001 22.454712 22.454712 26.500422 26.500422 26.500422 
##        15        16        17        18        19        20        21 
## 26.500422 30.546132 30.546132 30.546132 30.546132 34.591842 34.591842 
##        22        23        24        25        26        27        28 
## 34.591842 34.591842 38.637553 38.637553 38.637553 42.683263 42.683263 
##        29        30        31        32        33        34        35 
## 46.728973 46.728973 46.728973 50.774684 50.774684 50.774684 50.774684 
##        36        37        38        39        40        41        42 
## 54.820394 54.820394 54.820394 58.866104 58.866104 58.866104 58.866104 
##        43        44        45        46        47        48        49 
## 58.866104 66.957525 71.003235 75.048945 75.048945 75.048945 75.048945 
##        50 
## 79.094655

Creating a linear model -Module2/Sprint1

Alan Ibeagha

11/15/2019