The dataset selected was the “US fatal road accident data for automobiles, 1998 to 2010”.

1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes

theURL <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/gamclass/FARS.csv"
accidentData <- read.table(file = theURL, header = TRUE, sep = ",")
numRow = nrow(accidentData)
numCol = ncol(accidentData)
head(accidentData)
##         X caseid state age airbag injury restraint sex inimpact modelyr
## 1 1998.30  1:1:2     1  20     30      3         1   2       12    1991
## 2 1998.50  1:2:1     1  41     30      2         0   1        2    1987
## 3 1998.70  1:3:1     1  26     30      3         0   1        4    1983
## 4 1998.13  1:8:1     1  17      1      4         0   1       11    1997
## 5 1998.17 1:10:1     1  19     30      3         0   1       12    1998
## 6 1998.22 1:13:1     1   1     30      4         4   1        3    1991
##   airbagAvail airbagDeploy Restraint D_injury D_airbagAvail D_airbagDeploy
## 1          no           no       yes        3            no             no
## 2          no           no        no        2            no             no
## 3          no           no        no        4            no             no
## 4         yes          yes        no        4           yes            yes
## 5          no           no        no        4           yes            yes
## 6          no           no       yes        3            no             no
##   D_Restraint year
## 1         yes 1998
## 2          no 1998
## 3          no 1998
## 4          no 1998
## 5          no 1998
## 6         yes 1998
summary(accidentData)
##        X              caseid           state            age        
##  Min.   :1998   12:243:1 :     9   Min.   : 1.00   Min.   :  0.00  
##  1st Qu.:2001   17:691:1 :     9   1st Qu.:12.00   1st Qu.: 19.00  
##  Median :2004   53:101:1 :     9   Median :27.00   Median : 27.00  
##  Mean   :2004   6:1512:1 :     9   Mean   :27.23   Mean   : 52.55  
##  3rd Qu.:2007   12:2221:1:     8   3rd Qu.:42.00   3rd Qu.: 49.00  
##  Max.   :2011   12:248:1 :     8   Max.   :56.00   Max.   :999.00  
##                 (Other)  :151106                                   
##      airbag          injury        restraint           sex       
##  Min.   : 0.00   Min.   :0.000   Min.   : 0.000   Min.   :1.000  
##  1st Qu.: 1.00   1st Qu.:1.000   1st Qu.: 0.000   1st Qu.:1.000  
##  Median :29.00   Median :3.000   Median : 3.000   Median :2.000  
##  Mean   :22.72   Mean   :2.529   Mean   : 8.962   Mean   :1.553  
##  3rd Qu.:30.00   3rd Qu.:4.000   3rd Qu.: 3.000   3rd Qu.:2.000  
##  Max.   :99.00   Max.   :8.000   Max.   :99.000   Max.   :9.000  
##                                                                  
##     inimpact        modelyr      airbagAvail     airbagDeploy  
##  Min.   : 0.00   Min.   :1924   NA-code: 8485   NA-code:21305  
##  1st Qu.: 4.00   1st Qu.:1992   no     :55729   no     :86228  
##  Median :11.00   Median :1996   yes    :86944   yes    :43625  
##  Mean   :10.07   Mean   :2006                                  
##  3rd Qu.:12.00   3rd Qu.:2000                                  
##  Max.   :99.00   Max.   :9999                                  
##                                                                
##    Restraint        D_injury    D_airbagAvail   D_airbagDeploy 
##  NA-code:10535   Min.   :0.00   NA-code: 7232   NA-code:21096  
##  no     :46452   1st Qu.:1.00   no     :45087   no     :78480  
##  yes    :94171   Median :3.00   yes    :98839   yes    :51582  
##                  Mean   :2.47                                  
##                  3rd Qu.:4.00                                  
##                  Max.   :8.00                                  
##                                                                
##   D_Restraint         year     
##  NA-code:10505   Min.   :1998  
##  no     :44629   1st Qu.:2000  
##  yes    :96024   Median :2003  
##                  Mean   :2004  
##                  3rd Qu.:2006  
##                  Max.   :2010  
## 

The mean and median for the ages:

mean(accidentData$age)
## [1] 52.54645
median(accidentData$age)
## [1] 27

The mean and median for the model years:

mean(accidentData$modelyr)
## [1] 2005.587
median(accidentData$modelyr)
## [1] 1996

2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

Here, we are randomly extracting from the whole set about 1/3 of the rows and only the age and model year columns.

set.seed(1)
numSelection <- as.integer(1/3*numRow)
randomSelect <- c(sample(1:nrow(accidentData), numSelection))
newDataSet <- accidentData[randomSelect, c("age", "modelyr")]

Store the original row number as a new attribute (column), and renumber the rows.

newDataSet["OrigRowNum"] <- rownames(newDataSet)
rownames(newDataSet) <- NULL

3. Create new column names for the new data frame.

require(plyr)
## Loading required package: plyr
newDataSet <- rename(newDataSet, c("age"="AGE", "modelyr"="MODEL_YEAR", "OrigRowNum"="ORIG_ROW_NUM"))

4. Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.

summary(newDataSet)
##       AGE           MODEL_YEAR   ORIG_ROW_NUM      
##  Min.   :  0.00   Min.   :1928   Length:50386      
##  1st Qu.: 19.00   1st Qu.:1992   Class :character  
##  Median : 27.00   Median :1996   Mode  :character  
##  Mean   : 53.29   Mean   :2008                     
##  3rd Qu.: 49.00   3rd Qu.:2000                     
##  Max.   :999.00   Max.   :9999

The mean and median for the ages:

mean(newDataSet$AGE)
## [1] 53.28873
median(newDataSet$AGE)
## [1] 27

Recall that the mean of ages for the whole data set was 52.54645, and the median was 27. It seems the randomly extraction of data did not change the median, but shifted the mean.

The mean and median for the model years:

mean(newDataSet$MODEL_YEAR)
## [1] 2007.852
median(newDataSet$MODEL_YEAR)
## [1] 1996

Recall that the mean of model years for the whole data set was 2005.587, and the median was 1996. Again, the median remains unchanged after the extraction, but the mean shifted.

5. For at least 3 values in a column please rename so that every value in that column is renamed.

Here we look for any model that is made in year 2000, and we rename that to “Y2K!!”

modelYear <- newDataSet[, "MODEL_YEAR"]
newDataSet[, "MODEL_YEAR"] <- ifelse(modelYear == 2000, "Y2K!!", modelYear)

6. Display enough rows to see exmaples of all of steps 1-5 above.

head(newDataSet, 100)
##     AGE MODEL_YEAR ORIG_ROW_NUM
## 1    44       1986        40134
## 2    28      Y2K!!        56250
## 3    12       1998        86591
## 4    50       2002       137281
## 5    19       1989        30486
## 6    58       2004       135795
## 7    40       1992       142790
## 8     2       1995        99881
## 9    45       1990        95091
## 10   45       1996         9339
## 11   53       1986        31133
## 12   38       1997        26687
## 13   25       2003       103841
## 14   17       1999        58056
## 15   48      Y2K!!       116357
## 16   40       2002        75224
## 17   31       2004       108463
## 18   20      Y2K!!       149918
## 19   61       1994        57439
## 20    8       1998       117503
## 21   33       2001       141270
## 22   10       1990        32063
## 23   22       2005        98492
## 24   31       1989        18976
## 25   47       1990        40387
## 26   20       1992        58355
## 27   53       1994         2024
## 28   15       1993        57791
## 29   16       1995       131437
## 30   16       1997        51437
## 31   17       1990        72856
## 32   17       1992        90611
## 33   16       2002        74587
## 34   14       1999        28143
## 35   20       1992       125036
## 36   32       2003       101021
## 37   17       1992       120028
## 38   64       1983        16313
## 39   25       2004       109368
## 40   22       2001        62152
## 41   18       1996       124060
## 42   52       2002        97782
## 43   17      Y2K!!       118314
## 44   16       1993        83573
## 45   15       2001        80049
## 46   82       2002       119282
## 47   75       1996         3526
## 48   19       1996        72115
## 49    6       1990       110660
## 50   82       2005       104678
## 51   27       1995        72173
## 52   60       1997       130135
## 53   35       1987        66200
## 54   38       1999        36991
## 55   49       1988        10680
## 56   25       1997        15030
## 57   25       1989        47790
## 58   57       2001        78367
## 59   21       1988       100029
## 60   16       1992        61472
## 61   21       1996       137934
## 62   61       1994        44363
## 63   15       1993        69363
## 64   40       1987        50224
## 65   16       1992        98343
## 66   78       1999        38985
## 67    9       1991        72305
## 68   26       1993       115783
## 69   39       1992        12729
## 70   76       1996       132252
## 71   15       1993        51230
## 72   17       2004       126829
## 73   22       2002        52380
## 74   42       1989        50429
## 75   16       1998        71970
## 76   18       2005       134797
## 77   82       1990       130587
## 78   48       1993        58921
## 79    8       2006       117438
## 80   81       2001       145130
## 81   42       1997        65668
## 82   10       2001       107645
## 83   12       1997        60430
## 84    8       1993        49153
## 85   73       2001       114377
## 86   73       1992        30622
## 87   18       1998       107431
## 88   27       1989        18385
## 89   36       1996        37086
## 90   42       1998        21649
## 91   28       1995        36201
## 92   18       1992         8904
## 93   30       1998        97028
## 94   21       1995       132374
## 95   24       2004       117666
## 96   49       1987       120444
## 97   63       2002        68775
## 98   41       1988        61948
## 99   55       2006       122491
## 100  80       2003        91381

7. Place the original .csv in a github file and have R read from the link.

theNewURL <- "https://raw.githubusercontent.com/Tyllis/Rbridge/master/FARS.csv"
gitHubFile <- read.csv(file=theNewURL, header = TRUE, sep = ",")
head(gitHubFile)
##         X caseid state age airbag injury restraint sex inimpact modelyr
## 1 1998.30  1:1:2     1  20     30      3         1   2       12    1991
## 2 1998.50  1:2:1     1  41     30      2         0   1        2    1987
## 3 1998.70  1:3:1     1  26     30      3         0   1        4    1983
## 4 1998.13  1:8:1     1  17      1      4         0   1       11    1997
## 5 1998.17 1:10:1     1  19     30      3         0   1       12    1998
## 6 1998.22 1:13:1     1   1     30      4         4   1        3    1991
##   airbagAvail airbagDeploy Restraint D_injury D_airbagAvail D_airbagDeploy
## 1          no           no       yes        3            no             no
## 2          no           no        no        2            no             no
## 3          no           no        no        4            no             no
## 4         yes          yes        no        4           yes            yes
## 5          no           no        no        4           yes            yes
## 6          no           no       yes        3            no             no
##   D_Restraint year
## 1         yes 1998
## 2          no 1998
## 3          no 1998
## 4          no 1998
## 5          no 1998
## 6         yes 1998