R Bridge Course Week 2 Assignment

This assignment is about the data-wrangling / data-munging. I have tried my best to attempt the questions given in this assignment. As I learn R, I am enjoying solving these assignment questions and learning on the go.

This assignment is around a data set that can be downloaded from the link given in the dataset.csv on the link http://vincentarelbundock.github.io/Rdatasets/ I have decided to take the data set - carprice from the github link: https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/carprice.csv

First Step : To get the data set to be worked upon, from the github link / repository, and read it into R. I have used the data set carprice.csv to work with in this week’s assignment. This data set gives the type of car, various price attributes and miles/gallon of the various car models.

## The first part if to select a data set from the given list, and read the data set into RStudio.
##There are 2 ways to read the data set into R
## 1. Download the csv file from github and place it in the working directory, and then read it using read.csv function
## Please note that I have commented out this first way as it will not work on any other's local system as the other user will have their own working directory and the same data set might or might not be present in their working directory. Hence I have commented out with a single hash.

#getwd()

#carprice.data <- read.csv("carprice.csv", header = TRUE)
#View(carprice.data)

##2. Directly get the .csv file from the github link using the URL link for the csv file.

theURL <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/carprice.csv"

carprice.data <- read.csv(theURL, header = TRUE)

carprice.data
##     X    Type Min.Price Price Max.Price Range.Price RoughRange gpm100
## 1   6 Midsize      14.2  15.7      17.3         3.1       3.09    3.8
## 2   7   Large      19.9  20.8      21.7         1.8       1.79    4.2
## 3   8   Large      22.6  23.7      24.9         2.3       2.31    4.9
## 4   9 Midsize      26.3  26.3      26.3         0.0      -0.01    4.3
## 5  10   Large      33.0  34.7      36.3         3.3       3.30    4.9
## 6  11 Midsize      37.5  40.1      42.7         5.2       5.18    4.9
## 7  12 Compact       8.5  13.4      18.3         9.8       9.80    3.3
## 8  13 Compact      11.4  11.4      11.4         0.0      -0.01    3.4
## 9  14  Sporty      13.4  15.1      16.8         3.4       3.38    4.2
## 10 15 Midsize      13.4  15.9      18.4         5.0       5.01    4.0
## 11 16     Van      14.7  16.3      18.0         3.3       3.31    4.9
## 12 17     Van      14.7  16.6      18.6         3.9       3.90    5.7
## 13 18   Large      18.0  18.8      19.6         1.6       1.60    4.7
## 14 19  Sporty      34.6  38.0      41.5         6.9       6.88    4.8
## 15 20   Large      18.4  18.4      18.4         0.0      -0.01    4.2
## 16 21 Compact      14.5  15.8      17.1         2.6       2.59    3.9
## 17 22   Large      29.5  29.5      29.5         0.0       0.02    4.3
## 18 23   Small       7.9   9.2      10.6         2.7       2.68    3.2
## 19 24   Small       8.4  11.3      14.2         5.8       5.80    3.8
## 20 25 Compact      11.9  13.3      14.7         2.8       2.81    4.1
## 21 26     Van      13.6  19.0      24.4        10.8      10.77    5.3
## 22 27 Midsize      14.8  15.6      16.4         1.6       1.60    4.2
## 23 28  Sporty      18.5  25.8      33.1        14.6      14.60    4.8
## 24 29   Small       7.9  12.2      16.5         8.6       8.60    3.2
## 25 30   Large      17.5  19.3      21.2         3.7       3.69    4.2
## 26 31   Small       6.9   7.4       7.9         1.0       1.00    3.1
## 27 32   Small       8.4  10.1      11.9         3.5       3.49    3.8
## 28 33 Compact      10.4  11.3      12.2         1.8       1.82    4.1
## 29 34  Sporty      10.8  15.9      21.0        10.2      10.21    3.9
## 30 35  Sporty      12.8  14.0      15.2         2.4       2.40    3.7
## 31 36     Van      14.5  19.9      25.3        10.8      10.82    5.7
## 32 37 Midsize      15.6  20.2      24.8         9.2       9.21    3.9
## 33 38   Large      20.1  20.9      21.7         1.6       1.59    4.5
## 34 51 Midsize      33.3  34.3      35.3         2.0       1.99    4.7
## 35 52   Large      34.4  36.1      37.8         3.4       3.42    4.5
## 36 60  Sporty      13.3  14.1      15.0         1.7       1.71    4.1
## 37 61 Midsize      14.9  14.9      14.9         0.0      -0.02    4.4
## 38 68 Compact      13.0  13.5      14.0         1.0       0.99    3.6
## 39 69 Midsize      14.2  16.3      18.4         4.2       4.19    3.7
## 40 70     Van      19.5  19.5      19.5         0.0       0.00    4.9
## 41 71   Large      19.5  20.7      21.9         2.4       2.41    4.2
## 42 72  Sporty      11.4  14.4      17.4         6.0       6.01    3.8
## 43 73   Small       8.2   9.0       9.9         1.7       1.69    2.8
## 44 74 Compact       9.4  11.1      12.8         3.4       3.39    3.7
## 45 75  Sporty      14.0  17.7      21.4         7.4       7.40    4.2
## 46 76 Midsize      15.4  18.5      21.6         6.2       6.19    4.3
## 47 77   Large      19.4  24.4      29.4        10.0      10.00    4.2
## 48 79   Small       9.2  11.1      12.9         3.7       3.70    3.0
##    MPG.city MPG.highway
## 1        22          31
## 2        19          28
## 3        16          25
## 4        19          27
## 5        16          25
## 6        16          25
## 7        25          36
## 8        25          34
## 9        19          28
## 10       21          29
## 11       18          23
## 12       15          20
## 13       17          26
## 14       17          25
## 15       20          28
## 16       23          28
## 17       20          26
## 18       29          33
## 19       23          29
## 20       22          27
## 21       17          21
## 22       21          27
## 23       18          24
## 24       29          33
## 25       20          28
## 26       31          33
## 27       23          30
## 28       22          27
## 29       22          29
## 30       24          30
## 31       15          20
## 32       21          30
## 33       18          26
## 34       17          26
## 35       18          26
## 36       23          26
## 37       19          26
## 38       24          31
## 39       23          31
## 40       18          23
## 41       19          28
## 42       23          30
## 43       31          41
## 44       23          31
## 45       19          28
## 46       19          27
## 47       19          28
## 48       28          38

Now that we have pulled the carprice.csv data set into our data.frame - carprice.data, we will now perform the tasks given in the questions below.

  1. Usage of summary function and mean and median functions.
## [1] "The summary of carprice.csv data set is as below"
##        X              Type      Min.Price         Price      
##  Min.   : 6.00   Compact: 7   Min.   : 6.90   Min.   : 7.40  
##  1st Qu.:17.75   Large  :11   1st Qu.:11.40   1st Qu.:13.47  
##  Median :29.50   Midsize:10   Median :14.50   Median :16.30  
##  Mean   :36.54   Small  : 7   Mean   :16.54   Mean   :18.57  
##  3rd Qu.:60.25   Sporty : 8   3rd Qu.:19.43   3rd Qu.:20.73  
##  Max.   :79.00   Van    : 5   Max.   :37.50   Max.   :40.10  
##    Max.Price      Range.Price       RoughRange         gpm100     
##  Min.   : 7.90   Min.   : 0.000   Min.   :-0.020   Min.   :2.800  
##  1st Qu.:14.97   1st Qu.: 1.700   1st Qu.: 1.705   1st Qu.:3.800  
##  Median :18.40   Median : 3.300   Median : 3.305   Median :4.200  
##  Mean   :20.63   Mean   : 4.092   Mean   : 4.089   Mean   :4.167  
##  3rd Qu.:24.50   3rd Qu.: 5.850   3rd Qu.: 5.853   3rd Qu.:4.550  
##  Max.   :42.70   Max.   :14.600   Max.   :14.600   Max.   :5.700  
##     MPG.city      MPG.highway   
##  Min.   :15.00   Min.   :20.00  
##  1st Qu.:18.00   1st Qu.:26.00  
##  Median :20.00   Median :28.00  
##  Mean   :20.96   Mean   :28.15  
##  3rd Qu.:23.00   3rd Qu.:30.00  
##  Max.   :31.00   Max.   :41.00
## [1] "The names of columns in the data set are as below:"
##  [1] "X"           "Type"        "Min.Price"   "Price"       "Max.Price"  
##  [6] "Range.Price" "RoughRange"  "gpm100"      "MPG.city"    "MPG.highway"
## [1] "mean of price is :  18.57"
## [1] "median of price is :  16.3"
## [1] "mean of miles per gallon in city (MPG.city) is :  20.96"
## [1] "median of miles per gallon in city (MPG.city) is :  20"
## [1] "mean of Range Price is :  4.092"
## [1] "median of Range Price is :  3.3"
  1. Creating a subset data.frame from the original data.frame - carprice.data
## Here the subset data.frame will be created with rows 2, 5, 7, 10, 11, 12, 18, 19 and 20; and first 7 columns from the original data.frame

carprice.subset <- carprice.data[c(2,5,7,10:12,18:20), c(1:7)]
  1. Renaming the column names
names(carprice.subset)
## [1] "X"           "Type"        "Min.Price"   "Price"       "Max.Price"  
## [6] "Range.Price" "RoughRange"
## Now we are renaming the column names of the new subset data.frame - carprice.subset to have subset_ as the prefix to the corresponding column names of the original data.frame

names(carprice.subset) <- paste("subset_", names(carprice.subset))

## names of columns after renaming

names(carprice.subset)
## [1] "subset_ X"           "subset_ Type"        "subset_ Min.Price"  
## [4] "subset_ Price"       "subset_ Max.Price"   "subset_ Range.Price"
## [7] "subset_ RoughRange"
carprice.subset
##    subset_ X subset_ Type subset_ Min.Price subset_ Price
## 2          7        Large              19.9          20.8
## 5         10        Large              33.0          34.7
## 7         12      Compact               8.5          13.4
## 10        15      Midsize              13.4          15.9
## 11        16          Van              14.7          16.3
## 12        17          Van              14.7          16.6
## 18        23        Small               7.9           9.2
## 19        24        Small               8.4          11.3
## 20        25      Compact              11.9          13.3
##    subset_ Max.Price subset_ Range.Price subset_ RoughRange
## 2               21.7                 1.8               1.79
## 5               36.3                 3.3               3.30
## 7               18.3                 9.8               9.80
## 10              18.4                 5.0               5.01
## 11              18.0                 3.3               3.31
## 12              18.6                 3.9               3.90
## 18              10.6                 2.7               2.68
## 19              14.2                 5.8               5.80
## 20              14.7                 2.8               2.81
  1. Summary of the new data.frame, and mean and median of the new data.frame
## [1] "Summary of the subset data.frame : "
##    subset_ X      subset_ Type subset_ Min.Price subset_ Price  
##  Min.   : 7.00   Compact:2     Min.   : 7.90     Min.   : 9.20  
##  1st Qu.:12.00   Large  :2     1st Qu.: 8.50     1st Qu.:13.30  
##  Median :16.00   Midsize:1     Median :13.40     Median :15.90  
##  Mean   :16.56   Small  :2     Mean   :14.71     Mean   :16.83  
##  3rd Qu.:23.00   Sporty :0     3rd Qu.:14.70     3rd Qu.:16.60  
##  Max.   :25.00   Van    :2     Max.   :33.00     Max.   :34.70  
##  subset_ Max.Price subset_ Range.Price subset_ RoughRange
##  Min.   :10.60     Min.   :1.800       Min.   :1.790     
##  1st Qu.:14.70     1st Qu.:2.800       1st Qu.:2.810     
##  Median :18.30     Median :3.300       Median :3.310     
##  Mean   :18.98     Mean   :4.267       Mean   :4.267     
##  3rd Qu.:18.60     3rd Qu.:5.000       3rd Qu.:5.010     
##  Max.   :36.30     Max.   :9.800       Max.   :9.800
## [1] "mean of the price attribute of the subset data.frame is :  16.83"
## [1] "median of the price attribute of the subset data.frame is:  15.9"
## [1] "mean of the Range Price attribute of the subset data.frame is :  4.267"
## [1] "median of the Range Price attribute of the subset data.frame is :  3.3"
## [1] "carprice.data_Price_Mean :  18.57"
## [1] "carprice.subset_Price_Mean :  16.83"
## [1] "carprice.data Price mean is greater than carprice.subset Price mean"
## [1] "carprice.data_Range.Price_Mean :  4.092"
## [1] "carprice.subset_Range.Price_Mean :  4.267"
## [1] "carprice.data Range.Price mean is less than carprice.subset Range.Price mean"
## [1] "carprice.data_Price_Median :  16.3"
## [1] "carprice.subset_Price_Median :  15.9"
## [1] "carprice.data Price median is greater than carprice.subset Price median"
## [1] "carprice.data_Range.Price_Median :  3.3"
## [1] "carprice.subset_Range.Price_Median :  3.3"
## [1] "carprice.data Range.Price median is equal to carprice.subset Range.Price median"
  1. This question requires to change the values in a column of the original data.frame on a selective criteria. So, here we will use the column - Type, and change the values as shown below Midsize –> Mid-size Compact –> Super-small Sporty –> Sportz
## Now as the vector "Type" is a factor as by default the data.frame converts the character vector to a factor, so basically we are going to update the level values of the 3 levels as was given above. These 3 levels will now have the new values. 
## This can be done in multiple ways. I have given 2 way below:

## Way-1 - to use the levels function
levels(carprice.data$Type)[levels(carprice.data$Type) == 'Midsize'] <- "Mid-size"
levels(carprice.data$Type)[levels(carprice.data$Type) == "Compact"] <- "Super-small"
levels(carprice.data$Type)[levels(carprice.data$Type) == "Sporty"] <- "Sportz"

levels(carprice.data$Type)
## [1] "Super-small" "Large"       "Mid-size"    "Small"       "Sportz"     
## [6] "Van"
## Way-2 - Renaming levels of a factor can be achieved thru the function revalue in the plyr package. Before this, we will reload the data.frame from the github so that the original values are populated before we change them again with Way-2

carprice.data <- read.csv(theURL, header = TRUE)

levels(carprice.data$Type)
## [1] "Compact" "Large"   "Midsize" "Small"   "Sporty"  "Van"
library(plyr)

carprice.data$Type <- revalue(carprice.data$Type, c("Midsize" = "Mid-size", "Compact" = "Super-small", "Sporty" = "Sportz"))

levels(carprice.data$Type)
## [1] "Super-small" "Large"       "Mid-size"    "Small"       "Sportz"     
## [6] "Van"
  1. Display enough rows to see examples of all of steps 1-5 above
## [1] "First 20 rows of the original data.frame"
##     X        Type Min.Price Price Max.Price Range.Price RoughRange gpm100
## 1   6    Mid-size      14.2  15.7      17.3         3.1       3.09    3.8
## 2   7       Large      19.9  20.8      21.7         1.8       1.79    4.2
## 3   8       Large      22.6  23.7      24.9         2.3       2.31    4.9
## 4   9    Mid-size      26.3  26.3      26.3         0.0      -0.01    4.3
## 5  10       Large      33.0  34.7      36.3         3.3       3.30    4.9
## 6  11    Mid-size      37.5  40.1      42.7         5.2       5.18    4.9
## 7  12 Super-small       8.5  13.4      18.3         9.8       9.80    3.3
## 8  13 Super-small      11.4  11.4      11.4         0.0      -0.01    3.4
## 9  14      Sportz      13.4  15.1      16.8         3.4       3.38    4.2
## 10 15    Mid-size      13.4  15.9      18.4         5.0       5.01    4.0
## 11 16         Van      14.7  16.3      18.0         3.3       3.31    4.9
## 12 17         Van      14.7  16.6      18.6         3.9       3.90    5.7
## 13 18       Large      18.0  18.8      19.6         1.6       1.60    4.7
## 14 19      Sportz      34.6  38.0      41.5         6.9       6.88    4.8
## 15 20       Large      18.4  18.4      18.4         0.0      -0.01    4.2
## 16 21 Super-small      14.5  15.8      17.1         2.6       2.59    3.9
## 17 22       Large      29.5  29.5      29.5         0.0       0.02    4.3
## 18 23       Small       7.9   9.2      10.6         2.7       2.68    3.2
## 19 24       Small       8.4  11.3      14.2         5.8       5.80    3.8
## 20 25 Super-small      11.9  13.3      14.7         2.8       2.81    4.1
##    MPG.city MPG.highway
## 1        22          31
## 2        19          28
## 3        16          25
## 4        19          27
## 5        16          25
## 6        16          25
## 7        25          36
## 8        25          34
## 9        19          28
## 10       21          29
## 11       18          23
## 12       15          20
## 13       17          26
## 14       17          25
## 15       20          28
## 16       23          28
## 17       20          26
## 18       29          33
## 19       23          29
## 20       22          27
## [1] "First 5 rows of the subset data.frame"
##    subset_ X subset_ Type subset_ Min.Price subset_ Price
## 2          7        Large              19.9          20.8
## 5         10        Large              33.0          34.7
## 7         12      Compact               8.5          13.4
## 10        15      Midsize              13.4          15.9
## 11        16          Van              14.7          16.3
##    subset_ Max.Price subset_ Range.Price subset_ RoughRange
## 2               21.7                 1.8               1.79
## 5               36.3                 3.3               3.30
## 7               18.3                 9.8               9.80
## 10              18.4                 5.0               5.01
## 11              18.0                 3.3               3.31
  1. Get the data from the github file link
### This has already been handled as a part of the initial data load of the carprice.data which was loaded from the github file link