The dataset selected was the “US fatal road accident data for automobiles, 1998 to 2010”.
theURL <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/gamclass/FARS.csv"
accidentData <- read.table(file = theURL, header = TRUE, sep = ",")
numRow = nrow(accidentData)
numCol = ncol(accidentData)
head(accidentData)
## X caseid state age airbag injury restraint sex inimpact modelyr
## 1 1998.30 1:1:2 1 20 30 3 1 2 12 1991
## 2 1998.50 1:2:1 1 41 30 2 0 1 2 1987
## 3 1998.70 1:3:1 1 26 30 3 0 1 4 1983
## 4 1998.13 1:8:1 1 17 1 4 0 1 11 1997
## 5 1998.17 1:10:1 1 19 30 3 0 1 12 1998
## 6 1998.22 1:13:1 1 1 30 4 4 1 3 1991
## airbagAvail airbagDeploy Restraint D_injury D_airbagAvail D_airbagDeploy
## 1 no no yes 3 no no
## 2 no no no 2 no no
## 3 no no no 4 no no
## 4 yes yes no 4 yes yes
## 5 no no no 4 yes yes
## 6 no no yes 3 no no
## D_Restraint year
## 1 yes 1998
## 2 no 1998
## 3 no 1998
## 4 no 1998
## 5 no 1998
## 6 yes 1998
summary(accidentData)
## X caseid state age
## Min. :1998 12:243:1 : 9 Min. : 1.00 Min. : 0.00
## 1st Qu.:2001 17:691:1 : 9 1st Qu.:12.00 1st Qu.: 19.00
## Median :2004 53:101:1 : 9 Median :27.00 Median : 27.00
## Mean :2004 6:1512:1 : 9 Mean :27.23 Mean : 52.55
## 3rd Qu.:2007 12:2221:1: 8 3rd Qu.:42.00 3rd Qu.: 49.00
## Max. :2011 12:248:1 : 8 Max. :56.00 Max. :999.00
## (Other) :151106
## airbag injury restraint sex
## Min. : 0.00 Min. :0.000 Min. : 0.000 Min. :1.000
## 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.: 0.000 1st Qu.:1.000
## Median :29.00 Median :3.000 Median : 3.000 Median :2.000
## Mean :22.72 Mean :2.529 Mean : 8.962 Mean :1.553
## 3rd Qu.:30.00 3rd Qu.:4.000 3rd Qu.: 3.000 3rd Qu.:2.000
## Max. :99.00 Max. :8.000 Max. :99.000 Max. :9.000
##
## inimpact modelyr airbagAvail airbagDeploy
## Min. : 0.00 Min. :1924 NA-code: 8485 NA-code:21305
## 1st Qu.: 4.00 1st Qu.:1992 no :55729 no :86228
## Median :11.00 Median :1996 yes :86944 yes :43625
## Mean :10.07 Mean :2006
## 3rd Qu.:12.00 3rd Qu.:2000
## Max. :99.00 Max. :9999
##
## Restraint D_injury D_airbagAvail D_airbagDeploy
## NA-code:10535 Min. :0.00 NA-code: 7232 NA-code:21096
## no :46452 1st Qu.:1.00 no :45087 no :78480
## yes :94171 Median :3.00 yes :98839 yes :51582
## Mean :2.47
## 3rd Qu.:4.00
## Max. :8.00
##
## D_Restraint year
## NA-code:10505 Min. :1998
## no :44629 1st Qu.:2000
## yes :96024 Median :2003
## Mean :2004
## 3rd Qu.:2006
## Max. :2010
##
The mean and median for the ages:
mean(accidentData$age)
## [1] 52.54645
median(accidentData$age)
## [1] 27
The mean and median for the model years:
mean(accidentData$modelyr)
## [1] 2005.587
median(accidentData$modelyr)
## [1] 1996
Here, we are randomly extracting from the whole set about 1/3 of the rows and only the age and model year columns.
set.seed(1)
numSelection <- as.integer(1/3*numRow)
randomSelect <- c(sample(1:nrow(accidentData), numSelection))
newDataSet <- accidentData[randomSelect, c("age", "modelyr")]
Store the original row number as a new attribute (column), and renumber the rows.
newDataSet["OrigRowNum"] <- rownames(newDataSet)
rownames(newDataSet) <- NULL
require(plyr)
## Loading required package: plyr
newDataSet <- rename(newDataSet, c("age"="AGE", "modelyr"="MODEL_YEAR", "OrigRowNum"="ORIG_ROW_NUM"))
summary(newDataSet)
## AGE MODEL_YEAR ORIG_ROW_NUM
## Min. : 0.00 Min. :1928 Length:50386
## 1st Qu.: 19.00 1st Qu.:1992 Class :character
## Median : 27.00 Median :1996 Mode :character
## Mean : 53.29 Mean :2008
## 3rd Qu.: 49.00 3rd Qu.:2000
## Max. :999.00 Max. :9999
The mean and median for the ages:
mean(newDataSet$AGE)
## [1] 53.28873
median(newDataSet$AGE)
## [1] 27
Recall that the mean of ages for the whole data set was 52.54645, and the median was 27. It seems the randomly extraction of data did not change the median, but shifted the mean.
The mean and median for the model years:
mean(newDataSet$MODEL_YEAR)
## [1] 2007.852
median(newDataSet$MODEL_YEAR)
## [1] 1996
Recall that the mean of model years for the whole data set was 2005.587, and the median was 1996. Again, the median remains unchanged after the extraction, but the mean shifted.
Here we look for any model that is made in year 2000, and we rename that to “Y2K!!”
modelYear <- newDataSet[, "MODEL_YEAR"]
newDataSet[, "MODEL_YEAR"] <- ifelse(modelYear == 2000, "Y2K!!", modelYear)
head(newDataSet, 100)
## AGE MODEL_YEAR ORIG_ROW_NUM
## 1 44 1986 40134
## 2 28 Y2K!! 56250
## 3 12 1998 86591
## 4 50 2002 137281
## 5 19 1989 30486
## 6 58 2004 135795
## 7 40 1992 142790
## 8 2 1995 99881
## 9 45 1990 95091
## 10 45 1996 9339
## 11 53 1986 31133
## 12 38 1997 26687
## 13 25 2003 103841
## 14 17 1999 58056
## 15 48 Y2K!! 116357
## 16 40 2002 75224
## 17 31 2004 108463
## 18 20 Y2K!! 149918
## 19 61 1994 57439
## 20 8 1998 117503
## 21 33 2001 141270
## 22 10 1990 32063
## 23 22 2005 98492
## 24 31 1989 18976
## 25 47 1990 40387
## 26 20 1992 58355
## 27 53 1994 2024
## 28 15 1993 57791
## 29 16 1995 131437
## 30 16 1997 51437
## 31 17 1990 72856
## 32 17 1992 90611
## 33 16 2002 74587
## 34 14 1999 28143
## 35 20 1992 125036
## 36 32 2003 101021
## 37 17 1992 120028
## 38 64 1983 16313
## 39 25 2004 109368
## 40 22 2001 62152
## 41 18 1996 124060
## 42 52 2002 97782
## 43 17 Y2K!! 118314
## 44 16 1993 83573
## 45 15 2001 80049
## 46 82 2002 119282
## 47 75 1996 3526
## 48 19 1996 72115
## 49 6 1990 110660
## 50 82 2005 104678
## 51 27 1995 72173
## 52 60 1997 130135
## 53 35 1987 66200
## 54 38 1999 36991
## 55 49 1988 10680
## 56 25 1997 15030
## 57 25 1989 47790
## 58 57 2001 78367
## 59 21 1988 100029
## 60 16 1992 61472
## 61 21 1996 137934
## 62 61 1994 44363
## 63 15 1993 69363
## 64 40 1987 50224
## 65 16 1992 98343
## 66 78 1999 38985
## 67 9 1991 72305
## 68 26 1993 115783
## 69 39 1992 12729
## 70 76 1996 132252
## 71 15 1993 51230
## 72 17 2004 126829
## 73 22 2002 52380
## 74 42 1989 50429
## 75 16 1998 71970
## 76 18 2005 134797
## 77 82 1990 130587
## 78 48 1993 58921
## 79 8 2006 117438
## 80 81 2001 145130
## 81 42 1997 65668
## 82 10 2001 107645
## 83 12 1997 60430
## 84 8 1993 49153
## 85 73 2001 114377
## 86 73 1992 30622
## 87 18 1998 107431
## 88 27 1989 18385
## 89 36 1996 37086
## 90 42 1998 21649
## 91 28 1995 36201
## 92 18 1992 8904
## 93 30 1998 97028
## 94 21 1995 132374
## 95 24 2004 117666
## 96 49 1987 120444
## 97 63 2002 68775
## 98 41 1988 61948
## 99 55 2006 122491
## 100 80 2003 91381
theNewURL <- "https://raw.githubusercontent.com/Tyllis/Rbridge/master/FARS.csv"
gitHubFile <- read.csv(file=theNewURL, header = TRUE, sep = ",")
head(gitHubFile)
## X caseid state age airbag injury restraint sex inimpact modelyr
## 1 1998.30 1:1:2 1 20 30 3 1 2 12 1991
## 2 1998.50 1:2:1 1 41 30 2 0 1 2 1987
## 3 1998.70 1:3:1 1 26 30 3 0 1 4 1983
## 4 1998.13 1:8:1 1 17 1 4 0 1 11 1997
## 5 1998.17 1:10:1 1 19 30 3 0 1 12 1998
## 6 1998.22 1:13:1 1 1 30 4 4 1 3 1991
## airbagAvail airbagDeploy Restraint D_injury D_airbagAvail D_airbagDeploy
## 1 no no yes 3 no no
## 2 no no no 2 no no
## 3 no no no 4 no no
## 4 yes yes no 4 yes yes
## 5 no no no 4 yes yes
## 6 no no yes 3 no no
## D_Restraint year
## 1 yes 1998
## 2 no 1998
## 3 no 1998
## 4 no 1998
## 5 no 1998
## 6 yes 1998