Introduction: Pittsburgh Bridges Data Set

This report analyzes the Pittsburgh Bridges Data set and provides the data transformation logic to convert it to a usable format for data analysis. The Pittsburgh Bridges Data Set is provided in a flat file and data dictionary on the University of California Irvine Machine Learning Repository.

The required tasks are to: 1. Study the data set and its description in the data dictionary 3. Provide relevant column names. 2. Create an R data frame with a subset of its columns. 4. Deliver this R Markdown file to perform these transformation tasks.

The rest of this report is organized in accordance with the ordered lists of above first three tasks.

Study the Data Set and Its Description

We first access the raw data set online and used summary() to assess the data frame’s gross characteristics. There are two versions of the data file where V1 contains the original examples and V2 contains discretized numeric properties.

#URLV1="https://archive.ics.uci.edu/ml/machine-learning-databases/bridges/bridges.data.version1"
#URLV2="https://archive.ics.uci.edu/ml/machine-learning-databases/bridges/bridges.data.version2"
URLV2="./bridges.data.version2.txt"
#bridgesV1Raw = read.csv(URLV1, header=FALSE)
#str(bridgesV1Raw)
#summary(bridgesV1Raw)

bridgesV2Raw = read.csv(URLV2, header=FALSE)
str(bridgesV2Raw)
## 'data.frame':    108 obs. of  13 variables:
##  $ V1 : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
##  $ V2 : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
##  $ V3 : Factor w/ 55 levels "?","1","10","11",..: 24 19 35 23 17 21 22 24 35 23 ...
##  $ V4 : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ V5 : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
##  $ V6 : Factor w/ 4 levels "?","LONG","MEDIUM",..: 1 3 1 3 1 4 3 3 1 3 ...
##  $ V7 : Factor w/ 5 levels "?","1","2","4",..: 3 3 2 3 3 3 2 3 2 3 ...
##  $ V8 : Factor w/ 3 levels "?","G","N": 3 3 3 3 3 3 3 3 3 3 ...
##  $ V9 : Factor w/ 3 levels "?","DECK","THROUGH": 3 3 3 3 3 3 3 3 2 3 ...
##  $ V10: Factor w/ 4 levels "?","IRON","STEEL",..: 4 4 4 4 4 4 2 2 4 4 ...
##  $ V11: Factor w/ 4 levels "?","LONG","MEDIUM",..: 4 4 1 4 1 3 4 4 1 3 ...
##  $ V12: Factor w/ 4 levels "?","F","S","S-F": 3 3 3 3 3 3 3 3 3 3 ...
##  $ V13: Factor w/ 8 levels "?","ARCH","CANTILEV",..: 8 8 8 8 8 8 7 7 8 8 ...
summary(bridgesV2Raw)
##        V1      V2           V3            V4            V5          V6    
##  E1     :  1   A:49   28     : 5   CRAFTS  :18   AQUEDUCT: 4   ?     :27  
##  E10    :  1   M:41   39     : 5   EMERGING:15   HIGHWAY :71   LONG  :21  
##  E100   :  1   O:15   25     : 4   MATURE  :54   RR      :32   MEDIUM:48  
##  E101   :  1   Y: 3   27     : 4   MODERN  :21   WALK    : 1   SHORT :12  
##  E102   :  1          29     : 4                                          
##  E103   :  1          1      : 3                                          
##  (Other):102          (Other):83                                          
##  V7     V8           V9        V10         V11      V12           V13    
##  ?:16   ?: 2   ?      : 6   ?    : 2   ?     :16   ?  : 5   SIMPLE-T:44  
##  1: 4   G:80   DECK   :15   IRON :11   LONG  :30   F  :58   WOOD    :16  
##  2:61   N:26   THROUGH:87   STEEL:79   MEDIUM:53   S  :30   ARCH    :13  
##  4:23                       WOOD :16   SHORT : 9   S-F:15   CANTILEV:11  
##  6: 4                                                       SUSPEN  :11  
##                                                             CONT-T  :10  
##                                                             (Other) : 3

We observe that the data dictionary does not fully describe the range of values associated to all attributes of the version 2 data file accurately. Column V2 (associated with “RIVER” in the data dictionary) contains 4 possible values A, M, O, Y. Pittsburgh is bordered by three rivers Three Rivers called the Allegheny, Monongahela, and Ohio so that clearly explains A, M and O. However, three bridges have a river value of ‘Y’ for which no clear logic can be ascribed.

Renaming Data Columns

We use the data dictionary attributes to act as the new column names. The Data Dictionary provides good descriptions along with each attribute name. Keeping consistency between the data frame column names and the data dictionary version will allow easy mapping in the future.
We verify the number of attributes in the data dictionary is consistent with the data file dimensions.

rawColumns=c("IDENTIF", "RIVER", "LOCATION", "ERECTED", "PURPOSE", "LENGTH", 
             "LANES", "CLEAR-G", "T-OR-D", "MATERIAL", "SPAN", "REL-L","TYPE")

colnames(bridgesV2Raw) <- rawColumns

head(bridgesV2Raw)
##   IDENTIF RIVER LOCATION ERECTED  PURPOSE LENGTH LANES CLEAR-G  T-OR-D
## 1      E1     M        3  CRAFTS  HIGHWAY      ?     2       N THROUGH
## 2      E2     A       25  CRAFTS  HIGHWAY MEDIUM     2       N THROUGH
## 3      E3     A       39  CRAFTS AQUEDUCT      ?     1       N THROUGH
## 4      E5     A       29  CRAFTS  HIGHWAY MEDIUM     2       N THROUGH
## 5      E6     M       23  CRAFTS  HIGHWAY      ?     2       N THROUGH
## 6      E7     A       27  CRAFTS  HIGHWAY  SHORT     2       N THROUGH
##   MATERIAL   SPAN REL-L TYPE
## 1     WOOD  SHORT     S WOOD
## 2     WOOD  SHORT     S WOOD
## 3     WOOD      ?     S WOOD
## 4     WOOD  SHORT     S WOOD
## 5     WOOD      ?     S WOOD
## 6     WOOD MEDIUM     S WOOD
str(bridgesV2Raw)
## 'data.frame':    108 obs. of  13 variables:
##  $ IDENTIF : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
##  $ RIVER   : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
##  $ LOCATION: Factor w/ 55 levels "?","1","10","11",..: 24 19 35 23 17 21 22 24 35 23 ...
##  $ ERECTED : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PURPOSE : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
##  $ LENGTH  : Factor w/ 4 levels "?","LONG","MEDIUM",..: 1 3 1 3 1 4 3 3 1 3 ...
##  $ LANES   : Factor w/ 5 levels "?","1","2","4",..: 3 3 2 3 3 3 2 3 2 3 ...
##  $ CLEAR-G : Factor w/ 3 levels "?","G","N": 3 3 3 3 3 3 3 3 3 3 ...
##  $ T-OR-D  : Factor w/ 3 levels "?","DECK","THROUGH": 3 3 3 3 3 3 3 3 2 3 ...
##  $ MATERIAL: Factor w/ 4 levels "?","IRON","STEEL",..: 4 4 4 4 4 4 2 2 4 4 ...
##  $ SPAN    : Factor w/ 4 levels "?","LONG","MEDIUM",..: 4 4 1 4 1 3 4 4 1 3 ...
##  $ REL-L   : Factor w/ 4 levels "?","F","S","S-F": 3 3 3 3 3 3 3 3 3 3 ...
##  $ TYPE    : Factor w/ 8 levels "?","ARCH","CANTILEV",..: 8 8 8 8 8 8 7 7 8 8 ...
summary(bridgesV2Raw)
##     IDENTIF    RIVER     LOCATION      ERECTED       PURPOSE      LENGTH  
##  E1     :  1   A:49   28     : 5   CRAFTS  :18   AQUEDUCT: 4   ?     :27  
##  E10    :  1   M:41   39     : 5   EMERGING:15   HIGHWAY :71   LONG  :21  
##  E100   :  1   O:15   25     : 4   MATURE  :54   RR      :32   MEDIUM:48  
##  E101   :  1   Y: 3   27     : 4   MODERN  :21   WALK    : 1   SHORT :12  
##  E102   :  1          29     : 4                                          
##  E103   :  1          1      : 3                                          
##  (Other):102          (Other):83                                          
##  LANES  CLEAR-G     T-OR-D    MATERIAL      SPAN    REL-L          TYPE   
##  ?:16   ?: 2    ?      : 6   ?    : 2   ?     :16   ?  : 5   SIMPLE-T:44  
##  1: 4   G:80    DECK   :15   IRON :11   LONG  :30   F  :58   WOOD    :16  
##  2:61   N:26    THROUGH:87   STEEL:79   MEDIUM:53   S  :30   ARCH    :13  
##  4:23                        WOOD :16   SHORT : 9   S-F:15   CANTILEV:11  
##  6: 4                                                        SUSPEN  :11  
##                                                              CONT-T  :10  
##                                                              (Other) : 3

Subsetting Data

The data column with the most ambiguous value is LENGTH. 3 rows don’t identify the RIVER. So we will drop the subset with RIVER=Y and omit column LENGTH in the resulting data set to be delivered for analysis. Therefore we expect a resulting data set with 105 observations of 12 variables.

## 'data.frame':    105 obs. of  12 variables:
##  $ IDENTIF : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
##  $ RIVER   : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
##  $ LOCATION: Factor w/ 55 levels "?","1","10","11",..: 24 19 35 23 17 21 22 24 35 23 ...
##  $ ERECTED : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PURPOSE : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
##  $ LANES   : Factor w/ 5 levels "?","1","2","4",..: 3 3 2 3 3 3 2 3 2 3 ...
##  $ CLEAR-G : Factor w/ 3 levels "?","G","N": 3 3 3 3 3 3 3 3 3 3 ...
##  $ T-OR-D  : Factor w/ 3 levels "?","DECK","THROUGH": 3 3 3 3 3 3 3 3 2 3 ...
##  $ MATERIAL: Factor w/ 4 levels "?","IRON","STEEL",..: 4 4 4 4 4 4 2 2 4 4 ...
##  $ SPAN    : Factor w/ 4 levels "?","LONG","MEDIUM",..: 4 4 1 4 1 3 4 4 1 3 ...
##  $ REL-L   : Factor w/ 4 levels "?","F","S","S-F": 3 3 3 3 3 3 3 3 3 3 ...
##  $ TYPE    : Factor w/ 8 levels "?","ARCH","CANTILEV",..: 8 8 8 8 8 8 7 7 8 8 ...

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.