Execute the following cell to load the tidyverse library:
library(tidyverse)
Execute the following cell to load the data. Refer to this website http://archive.ics.uci.edu/ml/datasets/Auto+MPG for details on the dataset:
autompg = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
head(autompg,20)
Task 1: print the structure of the unedited data set. How many samples and features are there?
str(autompg)
'data.frame': 390 obs. of 8 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cyl : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
$ disp : num 307 350 318 304 302 429 454 440 455 390 ...
$ hp : num 130 165 150 150 140 198 220 215 225 190 ...
$ wt : num 3504 3693 3436 3433 3449 ...
$ acc : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin: Factor w/ 2 levels "international",..: 2 2 2 2 2 2 2 2 2 2 ...
# There are 398 samples with 8 features.
Execute the following cell to assign names to the columns of the dataframe:
colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
colnames(autompg)
[1] "mpg" "cyl" "disp" "hp" "wt" "acc" "year" "origin" "name"
Task-2: complete the code segment below to remove samples with missing horsepower (hp) values represented as a “?” in the dataset.
autompg = autompg %>% filter(!hp=="?")
autompg
Task-3: complete the code segment below to remove samples with the name “plymouth reliant”
autompg = autompg %>% filter(!name=="plymouth reliant")
autompg
Task-4: complete the code segment below to select all features except ‘name’
autompg = autompg %>% select(-name)
autompg
Execute the following cell to change the type of hp values from character to numeric:
autompg$hp = as.numeric(autompg$hp)
autompg$hp
[1] 130 165 150 150 140 198 220 215 225 190 170 160 150 225 95 95 97 85 88 46 87 90 95 113 90 215 200 210
[29] 193 88 90 95 100 105 100 88 100 165 175 153 150 180 170 175 110 72 100 88 86 90 70 76 65 69 60 70
[57] 95 80 54 90 86 165 175 150 153 150 208 155 160 190 97 150 130 140 150 112 76 87 69 86 92 97 80 88
[85] 175 150 145 137 150 198 150 158 150 215 225 175 105 100 100 88 95 46 150 167 170 180 100 88 72 94 90 85
[113] 107 90 145 230 49 75 91 112 150 110 122 180 95 100 100 67 80 65 75 100 110 105 140 150 150 140 150 83
[141] 67 78 52 61 75 75 75 97 93 67 95 105 72 72 170 145 150 148 110 105 110 95 110 110 129 75 83 100
[169] 78 96 71 97 97 70 90 95 88 98 115 53 86 81 92 79 83 140 150 120 152 100 105 81 90 52 60 70
[197] 53 100 78 110 95 71 70 75 72 102 150 88 108 120 180 145 130 150 68 80 58 96 70 145 110 145 130 110
[225] 105 100 98 180 170 190 149 78 88 75 89 63 83 67 78 97 110 110 48 66 52 70 60 110 140 139 105 95
[253] 85 88 100 90 105 85 110 120 145 165 139 140 68 95 97 75 95 105 85 97 103 125 115 133 71 68 115 85
[281] 88 90 110 130 129 138 135 155 142 125 150 71 65 80 80 77 125 71 90 70 70 65 69 90 115 115 90 76
[309] 60 70 65 90 88 90 90 78 90 75 92 75 65 105 65 48 48 67 67 67 67 62 132 100 88 72 84 92
[337] 110 58 64 60 67 65 62 68 63 65 65 74 75 75 100 74 80 76 116 120 110 105 88 85 88 88 88 85
[365] 84 90 92 74 68 68 63 70 88 75 70 67 67 67 110 85 92 112 96 84 90 86 52 84 79 82
Execute the following code cell to modify ‘origin’ column to reflect local (1) and international models (0)
autompg = autompg %>% mutate(origin = ifelse(!(origin %in% c(2, 3)), 'local', 'international'))
head(autompg, 20)
Task 5: print the structure of the dataframe. What types are the columns ‘cyl’ and ‘origin’?
str(autompg)
'data.frame': 390 obs. of 8 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cyl : int 8 8 8 8 8 8 8 8 8 8 ...
$ disp : num 307 350 318 304 302 429 454 440 455 390 ...
$ hp : num 130 165 150 150 140 198 220 215 225 190 ...
$ wt : num 3504 3693 3436 3433 3449 ...
$ acc : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin: chr "local" "local" "local" "local" ...
#column "cyl" is integers while "origin" is characters.
Task-6: complete the code segment below to change the types of ‘cyl’ and ‘origin’ columns to factor
catcols = c('cyl', 'origin')
autompg[catcols] = lapply(autompg[catcols],as.factor)
autompg[catcols]
Task-7: complete the code segment below to create a scatter plot of mpg vs. displacement by color coding the points according to the origin (local or international), Comment on what you observe:
p = ggplot(data =autompg , aes(x =mpg , y =disp , color = origin )) +geom_point()+ labs(x = 'mpg ', y = 'Displacement ', title = 'mpg vs.Displacement')
# Below shows the scatter plot of mpg vs displacement.The scatter plots from the same origin (local or international)are colored uniformly or have same color.
p