The first sep was to load the data
cars
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## 11 11 28
## 12 12 14
## 13 12 20
## 14 12 24
## 15 12 28
## 16 13 26
## 17 13 34
## 18 13 34
## 19 13 46
## 20 14 26
## 21 14 36
## 22 14 60
## 23 14 80
## 24 15 20
## 25 15 26
## 26 15 54
## 27 16 32
## 28 16 40
## 29 17 32
## 30 17 40
## 31 17 50
## 32 18 42
## 33 18 56
## 34 18 76
## 35 18 84
## 36 19 36
## 37 19 46
## 38 19 68
## 39 20 32
## 40 20 48
## 41 20 52
## 42 20 56
## 43 20 64
## 44 22 66
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
we tried to know the data by getting the attributes, summary, structure, names of attributes and names of columns
attributes(cars)
## $names
## [1] "speed" "dist"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
names(cars)
## [1] "speed" "dist"
cars$speed
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25
cars$dist
## [1] 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28 26 34
## [18] 34 46 26 36 60 80 20 26 54 32 40 32 40 50 42 56 76
## [35] 84 36 46 68 32 48 52 56 64 66 54 70 92 93 120 85
Findings Attributes showed us the column names, the class of data and the row names. The summary gave us vital details about the cars speed and distance. The structure told us the number of observations and variables.
Plots were then created to view the dataset
hist(cars$speed)
hist(cars$dist)
plot(cars$speed,cars$dist)
qqnorm(cars$speed)
qqnorm(cars$dist)
cars$dist <- as.numeric(cars$dist)
cars$dist
## [1] 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28 26 34
## [18] 34 46 26 36 60 80 20 26 54 32 40 32 40 50 42 56 76
## [35] 84 36 46 68 32 48 52 56 64 66 54 70 92 93 120 85
names(cars) <- c("SPEED", "DISTANCE")
names(cars)
## [1] "SPEED" "DISTANCE"
Findings From the histogram plot, it was observed that the most frequent speed was between 10 and 20, most frequent distance was between 20 and 40. Distance was also shown to increase as speed increased.
Preprocessing of the data was then done to identify the missing values, while some data types were changed as appropriate.
summary(cars)
## SPEED DISTANCE
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
is.na(cars)
## SPEED DISTANCE
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
## [5,] FALSE FALSE
## [6,] FALSE FALSE
## [7,] FALSE FALSE
## [8,] FALSE FALSE
## [9,] FALSE FALSE
## [10,] FALSE FALSE
## [11,] FALSE FALSE
## [12,] FALSE FALSE
## [13,] FALSE FALSE
## [14,] FALSE FALSE
## [15,] FALSE FALSE
## [16,] FALSE FALSE
## [17,] FALSE FALSE
## [18,] FALSE FALSE
## [19,] FALSE FALSE
## [20,] FALSE FALSE
## [21,] FALSE FALSE
## [22,] FALSE FALSE
## [23,] FALSE FALSE
## [24,] FALSE FALSE
## [25,] FALSE FALSE
## [26,] FALSE FALSE
## [27,] FALSE FALSE
## [28,] FALSE FALSE
## [29,] FALSE FALSE
## [30,] FALSE FALSE
## [31,] FALSE FALSE
## [32,] FALSE FALSE
## [33,] FALSE FALSE
## [34,] FALSE FALSE
## [35,] FALSE FALSE
## [36,] FALSE FALSE
## [37,] FALSE FALSE
## [38,] FALSE FALSE
## [39,] FALSE FALSE
## [40,] FALSE FALSE
## [41,] FALSE FALSE
## [42,] FALSE FALSE
## [43,] FALSE FALSE
## [44,] FALSE FALSE
## [45,] FALSE FALSE
## [46,] FALSE FALSE
## [47,] FALSE FALSE
## [48,] FALSE FALSE
## [49,] FALSE FALSE
## [50,] FALSE FALSE
Findings There was no missing data.
The dataset was then split. At this point the training set and the test sets were then created using the set.seed() function. The seed is a number that you choose for a starting point used to create a sequence of random numbers. It is also helpful for others who want to recreate your same results. Here is the function. A common set.seed number is 123.
set.seed(123)
The splitting then occurs in order to prepare the dataset for modelling. the split used was the 70/30 split. However, the 80/20 split can also be used.The two codes below actually help to calculate the sizes of each set
trainSize <- round(nrow(cars) * 0.7)
testSize <- nrow(cars) - trainSize
The number of instances in each set was then confirmed
trainSize
## [1] 35
testSize
## [1] 15
Finding The training size was 35 and the test size was 15
Then the training and test sets were created in a randomized order
training_indices <- sample(seq_len(nrow(cars)), size = trainSize)
trainSet <- cars[training_indices,]
testSet <- cars[-training_indices]
A model was then created using the linear model function
model1 <- lm(DISTANCE ~ SPEED, trainSet)
after which the performance of the model was evaluated
summary(model1)
##
## Call:
## lm(formula = DISTANCE ~ SPEED, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.820 -8.798 -2.272 5.614 44.951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.0481 7.4169 -2.973 0.00548 **
## SPEED 4.0457 0.4589 8.817 3.44e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.08 on 33 degrees of freedom
## Multiple R-squared: 0.702, Adjusted R-squared: 0.693
## F-statistic: 77.73 on 1 and 33 DF, p-value: 3.435e-10
Predictions were then carried out using the predict()
distance_covered <- predict(model1, testSet)
distance_covered
## 1 2 3 4 5 6 7
## -5.865260 -5.865260 6.271871 6.271871 10.317581 14.363291 18.409001
## 8 9 10 11 12 13 14
## 18.409001 18.409001 22.454712 22.454712 26.500422 26.500422 26.500422
## 15 16 17 18 19 20 21
## 26.500422 30.546132 30.546132 30.546132 30.546132 34.591842 34.591842
## 22 23 24 25 26 27 28
## 34.591842 34.591842 38.637553 38.637553 38.637553 42.683263 42.683263
## 29 30 31 32 33 34 35
## 46.728973 46.728973 46.728973 50.774684 50.774684 50.774684 50.774684
## 36 37 38 39 40 41 42
## 54.820394 54.820394 54.820394 58.866104 58.866104 58.866104 58.866104
## 43 44 45 46 47 48 49
## 58.866104 66.957525 71.003235 75.048945 75.048945 75.048945 75.048945
## 50
## 79.094655
Predictions were then viewed using the predictions name
distance_covered
## 1 2 3 4 5 6 7
## -5.865260 -5.865260 6.271871 6.271871 10.317581 14.363291 18.409001
## 8 9 10 11 12 13 14
## 18.409001 18.409001 22.454712 22.454712 26.500422 26.500422 26.500422
## 15 16 17 18 19 20 21
## 26.500422 30.546132 30.546132 30.546132 30.546132 34.591842 34.591842
## 22 23 24 25 26 27 28
## 34.591842 34.591842 38.637553 38.637553 38.637553 42.683263 42.683263
## 29 30 31 32 33 34 35
## 46.728973 46.728973 46.728973 50.774684 50.774684 50.774684 50.774684
## 36 37 38 39 40 41 42
## 54.820394 54.820394 54.820394 58.866104 58.866104 58.866104 58.866104
## 43 44 45 46 47 48 49
## 58.866104 66.957525 71.003235 75.048945 75.048945 75.048945 75.048945
## 50
## 79.094655