R Bridge Course Week 2 Assignment
This assignment is about the data-wrangling / data-munging. I have tried my best to attempt the questions given in this assignment. As I learn R, I am enjoying solving these assignment questions and learning on the go.
This assignment is around a data set that can be downloaded from the link given in the dataset.csv on the link http://vincentarelbundock.github.io/Rdatasets/ I have decided to take the data set - carprice from the github link: https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/carprice.csv
First Step : To get the data set to be worked upon, from the github link / repository, and read it into R. I have used the data set carprice.csv to work with in this week’s assignment. This data set gives the type of car, various price attributes and miles/gallon of the various car models.
## The first part if to select a data set from the given list, and read the data set into RStudio.
##There are 2 ways to read the data set into R
## 1. Download the csv file from github and place it in the working directory, and then read it using read.csv function
## Please note that I have commented out this first way as it will not work on any other's local system as the other user will have their own working directory and the same data set might or might not be present in their working directory. Hence I have commented out with a single hash.
#getwd()
#carprice.data <- read.csv("carprice.csv", header = TRUE)
#View(carprice.data)
##2. Directly get the .csv file from the github link using the URL link for the csv file.
theURL <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/carprice.csv"
carprice.data <- read.csv(theURL, header = TRUE)
carprice.data
## X Type Min.Price Price Max.Price Range.Price RoughRange gpm100
## 1 6 Midsize 14.2 15.7 17.3 3.1 3.09 3.8
## 2 7 Large 19.9 20.8 21.7 1.8 1.79 4.2
## 3 8 Large 22.6 23.7 24.9 2.3 2.31 4.9
## 4 9 Midsize 26.3 26.3 26.3 0.0 -0.01 4.3
## 5 10 Large 33.0 34.7 36.3 3.3 3.30 4.9
## 6 11 Midsize 37.5 40.1 42.7 5.2 5.18 4.9
## 7 12 Compact 8.5 13.4 18.3 9.8 9.80 3.3
## 8 13 Compact 11.4 11.4 11.4 0.0 -0.01 3.4
## 9 14 Sporty 13.4 15.1 16.8 3.4 3.38 4.2
## 10 15 Midsize 13.4 15.9 18.4 5.0 5.01 4.0
## 11 16 Van 14.7 16.3 18.0 3.3 3.31 4.9
## 12 17 Van 14.7 16.6 18.6 3.9 3.90 5.7
## 13 18 Large 18.0 18.8 19.6 1.6 1.60 4.7
## 14 19 Sporty 34.6 38.0 41.5 6.9 6.88 4.8
## 15 20 Large 18.4 18.4 18.4 0.0 -0.01 4.2
## 16 21 Compact 14.5 15.8 17.1 2.6 2.59 3.9
## 17 22 Large 29.5 29.5 29.5 0.0 0.02 4.3
## 18 23 Small 7.9 9.2 10.6 2.7 2.68 3.2
## 19 24 Small 8.4 11.3 14.2 5.8 5.80 3.8
## 20 25 Compact 11.9 13.3 14.7 2.8 2.81 4.1
## 21 26 Van 13.6 19.0 24.4 10.8 10.77 5.3
## 22 27 Midsize 14.8 15.6 16.4 1.6 1.60 4.2
## 23 28 Sporty 18.5 25.8 33.1 14.6 14.60 4.8
## 24 29 Small 7.9 12.2 16.5 8.6 8.60 3.2
## 25 30 Large 17.5 19.3 21.2 3.7 3.69 4.2
## 26 31 Small 6.9 7.4 7.9 1.0 1.00 3.1
## 27 32 Small 8.4 10.1 11.9 3.5 3.49 3.8
## 28 33 Compact 10.4 11.3 12.2 1.8 1.82 4.1
## 29 34 Sporty 10.8 15.9 21.0 10.2 10.21 3.9
## 30 35 Sporty 12.8 14.0 15.2 2.4 2.40 3.7
## 31 36 Van 14.5 19.9 25.3 10.8 10.82 5.7
## 32 37 Midsize 15.6 20.2 24.8 9.2 9.21 3.9
## 33 38 Large 20.1 20.9 21.7 1.6 1.59 4.5
## 34 51 Midsize 33.3 34.3 35.3 2.0 1.99 4.7
## 35 52 Large 34.4 36.1 37.8 3.4 3.42 4.5
## 36 60 Sporty 13.3 14.1 15.0 1.7 1.71 4.1
## 37 61 Midsize 14.9 14.9 14.9 0.0 -0.02 4.4
## 38 68 Compact 13.0 13.5 14.0 1.0 0.99 3.6
## 39 69 Midsize 14.2 16.3 18.4 4.2 4.19 3.7
## 40 70 Van 19.5 19.5 19.5 0.0 0.00 4.9
## 41 71 Large 19.5 20.7 21.9 2.4 2.41 4.2
## 42 72 Sporty 11.4 14.4 17.4 6.0 6.01 3.8
## 43 73 Small 8.2 9.0 9.9 1.7 1.69 2.8
## 44 74 Compact 9.4 11.1 12.8 3.4 3.39 3.7
## 45 75 Sporty 14.0 17.7 21.4 7.4 7.40 4.2
## 46 76 Midsize 15.4 18.5 21.6 6.2 6.19 4.3
## 47 77 Large 19.4 24.4 29.4 10.0 10.00 4.2
## 48 79 Small 9.2 11.1 12.9 3.7 3.70 3.0
## MPG.city MPG.highway
## 1 22 31
## 2 19 28
## 3 16 25
## 4 19 27
## 5 16 25
## 6 16 25
## 7 25 36
## 8 25 34
## 9 19 28
## 10 21 29
## 11 18 23
## 12 15 20
## 13 17 26
## 14 17 25
## 15 20 28
## 16 23 28
## 17 20 26
## 18 29 33
## 19 23 29
## 20 22 27
## 21 17 21
## 22 21 27
## 23 18 24
## 24 29 33
## 25 20 28
## 26 31 33
## 27 23 30
## 28 22 27
## 29 22 29
## 30 24 30
## 31 15 20
## 32 21 30
## 33 18 26
## 34 17 26
## 35 18 26
## 36 23 26
## 37 19 26
## 38 24 31
## 39 23 31
## 40 18 23
## 41 19 28
## 42 23 30
## 43 31 41
## 44 23 31
## 45 19 28
## 46 19 27
## 47 19 28
## 48 28 38
Now that we have pulled the carprice.csv data set into our data.frame - carprice.data, we will now perform the tasks given in the questions below.
## [1] "The summary of carprice.csv data set is as below"
## X Type Min.Price Price
## Min. : 6.00 Compact: 7 Min. : 6.90 Min. : 7.40
## 1st Qu.:17.75 Large :11 1st Qu.:11.40 1st Qu.:13.47
## Median :29.50 Midsize:10 Median :14.50 Median :16.30
## Mean :36.54 Small : 7 Mean :16.54 Mean :18.57
## 3rd Qu.:60.25 Sporty : 8 3rd Qu.:19.43 3rd Qu.:20.73
## Max. :79.00 Van : 5 Max. :37.50 Max. :40.10
## Max.Price Range.Price RoughRange gpm100
## Min. : 7.90 Min. : 0.000 Min. :-0.020 Min. :2.800
## 1st Qu.:14.97 1st Qu.: 1.700 1st Qu.: 1.705 1st Qu.:3.800
## Median :18.40 Median : 3.300 Median : 3.305 Median :4.200
## Mean :20.63 Mean : 4.092 Mean : 4.089 Mean :4.167
## 3rd Qu.:24.50 3rd Qu.: 5.850 3rd Qu.: 5.853 3rd Qu.:4.550
## Max. :42.70 Max. :14.600 Max. :14.600 Max. :5.700
## MPG.city MPG.highway
## Min. :15.00 Min. :20.00
## 1st Qu.:18.00 1st Qu.:26.00
## Median :20.00 Median :28.00
## Mean :20.96 Mean :28.15
## 3rd Qu.:23.00 3rd Qu.:30.00
## Max. :31.00 Max. :41.00
## [1] "The names of columns in the data set are as below:"
## [1] "X" "Type" "Min.Price" "Price" "Max.Price"
## [6] "Range.Price" "RoughRange" "gpm100" "MPG.city" "MPG.highway"
## [1] "mean of price is : 18.57"
## [1] "median of price is : 16.3"
## [1] "mean of miles per gallon in city (MPG.city) is : 20.96"
## [1] "median of miles per gallon in city (MPG.city) is : 20"
## [1] "mean of Range Price is : 4.092"
## [1] "median of Range Price is : 3.3"
## Here the subset data.frame will be created with rows 2, 5, 7, 10, 11, 12, 18, 19 and 20; and first 7 columns from the original data.frame
carprice.subset <- carprice.data[c(2,5,7,10:12,18:20), c(1:7)]
names(carprice.subset)
## [1] "X" "Type" "Min.Price" "Price" "Max.Price"
## [6] "Range.Price" "RoughRange"
## Now we are renaming the column names of the new subset data.frame - carprice.subset to have subset_ as the prefix to the corresponding column names of the original data.frame
names(carprice.subset) <- paste("subset_", names(carprice.subset))
## names of columns after renaming
names(carprice.subset)
## [1] "subset_ X" "subset_ Type" "subset_ Min.Price"
## [4] "subset_ Price" "subset_ Max.Price" "subset_ Range.Price"
## [7] "subset_ RoughRange"
carprice.subset
## subset_ X subset_ Type subset_ Min.Price subset_ Price
## 2 7 Large 19.9 20.8
## 5 10 Large 33.0 34.7
## 7 12 Compact 8.5 13.4
## 10 15 Midsize 13.4 15.9
## 11 16 Van 14.7 16.3
## 12 17 Van 14.7 16.6
## 18 23 Small 7.9 9.2
## 19 24 Small 8.4 11.3
## 20 25 Compact 11.9 13.3
## subset_ Max.Price subset_ Range.Price subset_ RoughRange
## 2 21.7 1.8 1.79
## 5 36.3 3.3 3.30
## 7 18.3 9.8 9.80
## 10 18.4 5.0 5.01
## 11 18.0 3.3 3.31
## 12 18.6 3.9 3.90
## 18 10.6 2.7 2.68
## 19 14.2 5.8 5.80
## 20 14.7 2.8 2.81
## [1] "Summary of the subset data.frame : "
## subset_ X subset_ Type subset_ Min.Price subset_ Price
## Min. : 7.00 Compact:2 Min. : 7.90 Min. : 9.20
## 1st Qu.:12.00 Large :2 1st Qu.: 8.50 1st Qu.:13.30
## Median :16.00 Midsize:1 Median :13.40 Median :15.90
## Mean :16.56 Small :2 Mean :14.71 Mean :16.83
## 3rd Qu.:23.00 Sporty :0 3rd Qu.:14.70 3rd Qu.:16.60
## Max. :25.00 Van :2 Max. :33.00 Max. :34.70
## subset_ Max.Price subset_ Range.Price subset_ RoughRange
## Min. :10.60 Min. :1.800 Min. :1.790
## 1st Qu.:14.70 1st Qu.:2.800 1st Qu.:2.810
## Median :18.30 Median :3.300 Median :3.310
## Mean :18.98 Mean :4.267 Mean :4.267
## 3rd Qu.:18.60 3rd Qu.:5.000 3rd Qu.:5.010
## Max. :36.30 Max. :9.800 Max. :9.800
## [1] "mean of the price attribute of the subset data.frame is : 16.83"
## [1] "median of the price attribute of the subset data.frame is: 15.9"
## [1] "mean of the Range Price attribute of the subset data.frame is : 4.267"
## [1] "median of the Range Price attribute of the subset data.frame is : 3.3"
## [1] "carprice.data_Price_Mean : 18.57"
## [1] "carprice.subset_Price_Mean : 16.83"
## [1] "carprice.data Price mean is greater than carprice.subset Price mean"
## [1] "carprice.data_Range.Price_Mean : 4.092"
## [1] "carprice.subset_Range.Price_Mean : 4.267"
## [1] "carprice.data Range.Price mean is less than carprice.subset Range.Price mean"
## [1] "carprice.data_Price_Median : 16.3"
## [1] "carprice.subset_Price_Median : 15.9"
## [1] "carprice.data Price median is greater than carprice.subset Price median"
## [1] "carprice.data_Range.Price_Median : 3.3"
## [1] "carprice.subset_Range.Price_Median : 3.3"
## [1] "carprice.data Range.Price median is equal to carprice.subset Range.Price median"
## Now as the vector "Type" is a factor as by default the data.frame converts the character vector to a factor, so basically we are going to update the level values of the 3 levels as was given above. These 3 levels will now have the new values.
## This can be done in multiple ways. I have given 2 way below:
## Way-1 - to use the levels function
levels(carprice.data$Type)[levels(carprice.data$Type) == 'Midsize'] <- "Mid-size"
levels(carprice.data$Type)[levels(carprice.data$Type) == "Compact"] <- "Super-small"
levels(carprice.data$Type)[levels(carprice.data$Type) == "Sporty"] <- "Sportz"
levels(carprice.data$Type)
## [1] "Super-small" "Large" "Mid-size" "Small" "Sportz"
## [6] "Van"
## Way-2 - Renaming levels of a factor can be achieved thru the function revalue in the plyr package. Before this, we will reload the data.frame from the github so that the original values are populated before we change them again with Way-2
carprice.data <- read.csv(theURL, header = TRUE)
levels(carprice.data$Type)
## [1] "Compact" "Large" "Midsize" "Small" "Sporty" "Van"
library(plyr)
carprice.data$Type <- revalue(carprice.data$Type, c("Midsize" = "Mid-size", "Compact" = "Super-small", "Sporty" = "Sportz"))
levels(carprice.data$Type)
## [1] "Super-small" "Large" "Mid-size" "Small" "Sportz"
## [6] "Van"
## [1] "First 20 rows of the original data.frame"
## X Type Min.Price Price Max.Price Range.Price RoughRange gpm100
## 1 6 Mid-size 14.2 15.7 17.3 3.1 3.09 3.8
## 2 7 Large 19.9 20.8 21.7 1.8 1.79 4.2
## 3 8 Large 22.6 23.7 24.9 2.3 2.31 4.9
## 4 9 Mid-size 26.3 26.3 26.3 0.0 -0.01 4.3
## 5 10 Large 33.0 34.7 36.3 3.3 3.30 4.9
## 6 11 Mid-size 37.5 40.1 42.7 5.2 5.18 4.9
## 7 12 Super-small 8.5 13.4 18.3 9.8 9.80 3.3
## 8 13 Super-small 11.4 11.4 11.4 0.0 -0.01 3.4
## 9 14 Sportz 13.4 15.1 16.8 3.4 3.38 4.2
## 10 15 Mid-size 13.4 15.9 18.4 5.0 5.01 4.0
## 11 16 Van 14.7 16.3 18.0 3.3 3.31 4.9
## 12 17 Van 14.7 16.6 18.6 3.9 3.90 5.7
## 13 18 Large 18.0 18.8 19.6 1.6 1.60 4.7
## 14 19 Sportz 34.6 38.0 41.5 6.9 6.88 4.8
## 15 20 Large 18.4 18.4 18.4 0.0 -0.01 4.2
## 16 21 Super-small 14.5 15.8 17.1 2.6 2.59 3.9
## 17 22 Large 29.5 29.5 29.5 0.0 0.02 4.3
## 18 23 Small 7.9 9.2 10.6 2.7 2.68 3.2
## 19 24 Small 8.4 11.3 14.2 5.8 5.80 3.8
## 20 25 Super-small 11.9 13.3 14.7 2.8 2.81 4.1
## MPG.city MPG.highway
## 1 22 31
## 2 19 28
## 3 16 25
## 4 19 27
## 5 16 25
## 6 16 25
## 7 25 36
## 8 25 34
## 9 19 28
## 10 21 29
## 11 18 23
## 12 15 20
## 13 17 26
## 14 17 25
## 15 20 28
## 16 23 28
## 17 20 26
## 18 29 33
## 19 23 29
## 20 22 27
## [1] "First 5 rows of the subset data.frame"
## subset_ X subset_ Type subset_ Min.Price subset_ Price
## 2 7 Large 19.9 20.8
## 5 10 Large 33.0 34.7
## 7 12 Compact 8.5 13.4
## 10 15 Midsize 13.4 15.9
## 11 16 Van 14.7 16.3
## subset_ Max.Price subset_ Range.Price subset_ RoughRange
## 2 21.7 1.8 1.79
## 5 36.3 3.3 3.30
## 7 18.3 9.8 9.80
## 10 18.4 5.0 5.01
## 11 18.0 3.3 3.31
### This has already been handled as a part of the initial data load of the carprice.data which was loaded from the github file link