This report analyzes the Pittsburgh Bridges Data set and provides the data transformation logic to convert it to a usable format for data analysis. The Pittsburgh Bridges Data Set is provided in a flat file and data dictionary on the University of California Irvine Machine Learning Repository.
The required tasks are to: 1. Study the data set and its description in the data dictionary 3. Provide relevant column names. 2. Create an R data frame with a subset of its columns. 4. Deliver this R Markdown file to perform these transformation tasks.
The rest of this report is organized in accordance with the ordered lists of above first three tasks.
We first access the raw data set online and used summary() to assess the data frame’s gross characteristics. There are two versions of the data file where V1 contains the original examples and V2 contains discretized numeric properties.
#URLV1="https://archive.ics.uci.edu/ml/machine-learning-databases/bridges/bridges.data.version1"
#URLV2="https://archive.ics.uci.edu/ml/machine-learning-databases/bridges/bridges.data.version2"
URLV2="./bridges.data.version2.txt"
#bridgesV1Raw = read.csv(URLV1, header=FALSE)
#str(bridgesV1Raw)
#summary(bridgesV1Raw)
bridgesV2Raw = read.csv(URLV2, header=FALSE)
str(bridgesV2Raw)
## 'data.frame': 108 obs. of 13 variables:
## $ V1 : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
## $ V2 : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
## $ V3 : Factor w/ 55 levels "?","1","10","11",..: 24 19 35 23 17 21 22 24 35 23 ...
## $ V4 : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ V5 : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
## $ V6 : Factor w/ 4 levels "?","LONG","MEDIUM",..: 1 3 1 3 1 4 3 3 1 3 ...
## $ V7 : Factor w/ 5 levels "?","1","2","4",..: 3 3 2 3 3 3 2 3 2 3 ...
## $ V8 : Factor w/ 3 levels "?","G","N": 3 3 3 3 3 3 3 3 3 3 ...
## $ V9 : Factor w/ 3 levels "?","DECK","THROUGH": 3 3 3 3 3 3 3 3 2 3 ...
## $ V10: Factor w/ 4 levels "?","IRON","STEEL",..: 4 4 4 4 4 4 2 2 4 4 ...
## $ V11: Factor w/ 4 levels "?","LONG","MEDIUM",..: 4 4 1 4 1 3 4 4 1 3 ...
## $ V12: Factor w/ 4 levels "?","F","S","S-F": 3 3 3 3 3 3 3 3 3 3 ...
## $ V13: Factor w/ 8 levels "?","ARCH","CANTILEV",..: 8 8 8 8 8 8 7 7 8 8 ...
summary(bridgesV2Raw)
## V1 V2 V3 V4 V5 V6
## E1 : 1 A:49 28 : 5 CRAFTS :18 AQUEDUCT: 4 ? :27
## E10 : 1 M:41 39 : 5 EMERGING:15 HIGHWAY :71 LONG :21
## E100 : 1 O:15 25 : 4 MATURE :54 RR :32 MEDIUM:48
## E101 : 1 Y: 3 27 : 4 MODERN :21 WALK : 1 SHORT :12
## E102 : 1 29 : 4
## E103 : 1 1 : 3
## (Other):102 (Other):83
## V7 V8 V9 V10 V11 V12 V13
## ?:16 ?: 2 ? : 6 ? : 2 ? :16 ? : 5 SIMPLE-T:44
## 1: 4 G:80 DECK :15 IRON :11 LONG :30 F :58 WOOD :16
## 2:61 N:26 THROUGH:87 STEEL:79 MEDIUM:53 S :30 ARCH :13
## 4:23 WOOD :16 SHORT : 9 S-F:15 CANTILEV:11
## 6: 4 SUSPEN :11
## CONT-T :10
## (Other) : 3
We observe that the data dictionary does not fully describe the range of values associated to all attributes of the version 2 data file accurately. Column V2 (associated with “RIVER” in the data dictionary) contains 4 possible values A, M, O, Y. Pittsburgh is bordered by three rivers Three Rivers called the Allegheny, Monongahela, and Ohio so that clearly explains A, M and O. However, three bridges have a river value of ‘Y’ for which no clear logic can be ascribed.
We use the data dictionary attributes to act as the new column names. The Data Dictionary provides good descriptions along with each attribute name. Keeping consistency between the data frame column names and the data dictionary version will allow easy mapping in the future.
We verify the number of attributes in the data dictionary is consistent with the data file dimensions.
rawColumns=c("IDENTIF", "RIVER", "LOCATION", "ERECTED", "PURPOSE", "LENGTH",
"LANES", "CLEAR-G", "T-OR-D", "MATERIAL", "SPAN", "REL-L","TYPE")
colnames(bridgesV2Raw) <- rawColumns
head(bridgesV2Raw)
## IDENTIF RIVER LOCATION ERECTED PURPOSE LENGTH LANES CLEAR-G T-OR-D
## 1 E1 M 3 CRAFTS HIGHWAY ? 2 N THROUGH
## 2 E2 A 25 CRAFTS HIGHWAY MEDIUM 2 N THROUGH
## 3 E3 A 39 CRAFTS AQUEDUCT ? 1 N THROUGH
## 4 E5 A 29 CRAFTS HIGHWAY MEDIUM 2 N THROUGH
## 5 E6 M 23 CRAFTS HIGHWAY ? 2 N THROUGH
## 6 E7 A 27 CRAFTS HIGHWAY SHORT 2 N THROUGH
## MATERIAL SPAN REL-L TYPE
## 1 WOOD SHORT S WOOD
## 2 WOOD SHORT S WOOD
## 3 WOOD ? S WOOD
## 4 WOOD SHORT S WOOD
## 5 WOOD ? S WOOD
## 6 WOOD MEDIUM S WOOD
str(bridgesV2Raw)
## 'data.frame': 108 obs. of 13 variables:
## $ IDENTIF : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
## $ RIVER : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
## $ LOCATION: Factor w/ 55 levels "?","1","10","11",..: 24 19 35 23 17 21 22 24 35 23 ...
## $ ERECTED : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PURPOSE : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
## $ LENGTH : Factor w/ 4 levels "?","LONG","MEDIUM",..: 1 3 1 3 1 4 3 3 1 3 ...
## $ LANES : Factor w/ 5 levels "?","1","2","4",..: 3 3 2 3 3 3 2 3 2 3 ...
## $ CLEAR-G : Factor w/ 3 levels "?","G","N": 3 3 3 3 3 3 3 3 3 3 ...
## $ T-OR-D : Factor w/ 3 levels "?","DECK","THROUGH": 3 3 3 3 3 3 3 3 2 3 ...
## $ MATERIAL: Factor w/ 4 levels "?","IRON","STEEL",..: 4 4 4 4 4 4 2 2 4 4 ...
## $ SPAN : Factor w/ 4 levels "?","LONG","MEDIUM",..: 4 4 1 4 1 3 4 4 1 3 ...
## $ REL-L : Factor w/ 4 levels "?","F","S","S-F": 3 3 3 3 3 3 3 3 3 3 ...
## $ TYPE : Factor w/ 8 levels "?","ARCH","CANTILEV",..: 8 8 8 8 8 8 7 7 8 8 ...
summary(bridgesV2Raw)
## IDENTIF RIVER LOCATION ERECTED PURPOSE LENGTH
## E1 : 1 A:49 28 : 5 CRAFTS :18 AQUEDUCT: 4 ? :27
## E10 : 1 M:41 39 : 5 EMERGING:15 HIGHWAY :71 LONG :21
## E100 : 1 O:15 25 : 4 MATURE :54 RR :32 MEDIUM:48
## E101 : 1 Y: 3 27 : 4 MODERN :21 WALK : 1 SHORT :12
## E102 : 1 29 : 4
## E103 : 1 1 : 3
## (Other):102 (Other):83
## LANES CLEAR-G T-OR-D MATERIAL SPAN REL-L TYPE
## ?:16 ?: 2 ? : 6 ? : 2 ? :16 ? : 5 SIMPLE-T:44
## 1: 4 G:80 DECK :15 IRON :11 LONG :30 F :58 WOOD :16
## 2:61 N:26 THROUGH:87 STEEL:79 MEDIUM:53 S :30 ARCH :13
## 4:23 WOOD :16 SHORT : 9 S-F:15 CANTILEV:11
## 6: 4 SUSPEN :11
## CONT-T :10
## (Other) : 3
The data column with the most ambiguous value is LENGTH. 3 rows don’t identify the RIVER. So we will drop the subset with RIVER=Y and omit column LENGTH in the resulting data set to be delivered for analysis. Therefore we expect a resulting data set with 105 observations of 12 variables.
## 'data.frame': 105 obs. of 12 variables:
## $ IDENTIF : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
## $ RIVER : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
## $ LOCATION: Factor w/ 55 levels "?","1","10","11",..: 24 19 35 23 17 21 22 24 35 23 ...
## $ ERECTED : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PURPOSE : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
## $ LANES : Factor w/ 5 levels "?","1","2","4",..: 3 3 2 3 3 3 2 3 2 3 ...
## $ CLEAR-G : Factor w/ 3 levels "?","G","N": 3 3 3 3 3 3 3 3 3 3 ...
## $ T-OR-D : Factor w/ 3 levels "?","DECK","THROUGH": 3 3 3 3 3 3 3 3 2 3 ...
## $ MATERIAL: Factor w/ 4 levels "?","IRON","STEEL",..: 4 4 4 4 4 4 2 2 4 4 ...
## $ SPAN : Factor w/ 4 levels "?","LONG","MEDIUM",..: 4 4 1 4 1 3 4 4 1 3 ...
## $ REL-L : Factor w/ 4 levels "?","F","S","S-F": 3 3 3 3 3 3 3 3 3 3 ...
## $ TYPE : Factor w/ 8 levels "?","ARCH","CANTILEV",..: 8 8 8 8 8 8 7 7 8 8 ...
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.