Introduction

Your task is to choose one data set, then study the data and its associated description of the data (i.e. “data dictionary”). You should take the data, and create an R data frame with a subset of the columns (and if you like rows) in the data set. Your deliverable is the R code to perform these transformation tasks.

The data set I will examine and transform is the “Pittsburgh Bridges” data set from the archive at UC Irvine. Here is the data description or dictionary as given in the file “bridge.names.txt.

Data Description

  1. Attribute Information:

    The type field state whether a property is continuous/integer (c) or nominal (n).
    
    For properties with c,n type, the range of continuous numbers is given first
    and the possible values of the nominal follow the semi-colon.
Column Name Type Possible Values Comments
1 IDENTIF - - Identifier of the examples
2 RIVER n A, M, O
3 LOCATION n 1 to 52
4 ERECTED c,n 1818-1986 ; CRAFTS, EMERGING, MATURE, MODERN
5 PURPOSE n WALK, AQUEDUCT, RR, HIGHWAY
6 LENGTH c,n 804-4558 ; SHORT, MEDIUM, LONG
7 LANES c,n 1, 2, 4, 6 ; 1, 2, 4, 6
8 CLEAR-G n N, G
9 T-OR-D n THROUGH, DECK
10 MATERIAL n WOOD, IRON, STEEL
11 SPAN n SHORT, MEDUIM, LONG
12 REL-L n S, S-F, F
13 TYPE n WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, CONT-T

There are two versions of the data file as described below.

  1. Number of instances: 108

  2. Relevant Information:

    There are two versions to the database:
    
            V1 contains the original examples and
    
            V2 contains descriptions after discretizing numeric properties.

Data Examination

We will start by examining both data sets to see if we can discern the difference “discretizing numeric properties” makes to the data sets. We do this by reading in both versions of the data file and putting them into separate dataframes. We then run the Structure Function on each one and use View to browse the data.

Note on Example Code

I learned a few things loading data from GitHub. I found the “raw” button in GitHub and used it for the first time. This explained the word raw in the path of the example. I also found that if I used the example as written (replacing my path), I would lose an observation. I believe this is because the example assumes a header row, so I was losing the first row. This can be solved by adding “header = FALSE” to the read statement. I also added “na.strings =”?“” to my read and decided to write it all as one statement. A function like getURL() that returns a full path when given a website and a file name would be of great use for a data repository site. However, this function seems to require the full path, so only seems to save space in the read statement.

Load Data From GitHub Example

library(RCurl)

x <- getURL(“https://raw.github.com/aronlindberg/latent_growth_classes/master/LGC_data.csv”)

y <- read.csv(text = x)

library(RCurl)
## Loading required package: bitops
df1 <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/bridges.data.version1.txt", header = FALSE, na.strings = "?")

df2 <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/bridges.data.version2.txt", header = FALSE, na.strings = "?")

str(df1)
## 'data.frame':    108 obs. of  13 variables:
##  $ V1 : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
##  $ V2 : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
##  $ V3 : num  3 25 39 29 23 27 28 3 39 29 ...
##  $ V4 : int  1818 1819 1829 1837 1838 1840 1844 1846 1848 1851 ...
##  $ V5 : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
##  $ V6 : int  NA 1037 NA 1000 NA 990 1000 1500 NA 1000 ...
##  $ V7 : int  2 2 1 2 2 2 1 2 1 2 ...
##  $ V8 : Factor w/ 2 levels "G","N": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V9 : Factor w/ 2 levels "DECK","THROUGH": 2 2 2 2 2 2 2 2 1 2 ...
##  $ V10: Factor w/ 3 levels "IRON","STEEL",..: 3 3 3 3 3 3 1 1 3 3 ...
##  $ V11: Factor w/ 3 levels "LONG","MEDIUM",..: 3 3 NA 3 NA 2 3 3 NA 2 ...
##  $ V12: Factor w/ 3 levels "F","S","S-F": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V13: Factor w/ 7 levels "ARCH","CANTILEV",..: 7 7 7 7 7 7 6 6 7 7 ...
str(df2)
## 'data.frame':    108 obs. of  13 variables:
##  $ V1 : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
##  $ V2 : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
##  $ V3 : num  3 25 39 29 23 27 28 3 39 29 ...
##  $ V4 : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ V5 : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
##  $ V6 : Factor w/ 3 levels "LONG","MEDIUM",..: NA 2 NA 2 NA 3 2 2 NA 2 ...
##  $ V7 : int  2 2 1 2 2 2 1 2 1 2 ...
##  $ V8 : Factor w/ 2 levels "G","N": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V9 : Factor w/ 2 levels "DECK","THROUGH": 2 2 2 2 2 2 2 2 1 2 ...
##  $ V10: Factor w/ 3 levels "IRON","STEEL",..: 3 3 3 3 3 3 1 1 3 3 ...
##  $ V11: Factor w/ 3 levels "LONG","MEDIUM",..: 3 3 NA 3 NA 2 3 3 NA 2 ...
##  $ V12: Factor w/ 3 levels "F","S","S-F": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V13: Factor w/ 7 levels "ARCH","CANTILEV",..: 7 7 7 7 7 7 6 6 7 7 ...

Looking at the two files we see that both have 108 Observations and 13 Variables, so “discretizing numeric properties” did not change the number of fields. Looking at the results of the Structure() function we see some difference in the variables, which I will summarize in the following table.

Column Name Type df1 Type df2 Comments
1 IDENTIF - - Identifier of the examples, E1, E2 …
2 RIVER Factor Factor 4 levels A, M, O, Y
3 LOCATION num num 1 to 52
4 ERECTED int Factor df1 = 1818-1986; df2 = CRAFTS, EMERGING, MATURE, MODERN
5 PURPOSE Factor Factor WALK, AQUEDUCT, RR, HIGHWAY
6 LENGTH int Factor df1 = 804-4558; df2 = SHORT, MEDIUM, LONG
7 LANES int int 1, 2, 4, 6
8 CLEAR-G Factor Factor 2 levels N, G
9 T-OR-D Factor Factor 2 levels THROUGH, DECK
10 MATERIAL Factor Factor 3 levels WOOD, IRON, STEEL
11 SPAN Factor Factor 3 levels SHORT, MEDUIM, LONG
12 REL-L Factor Factor 3 levels S, S-F, F
13 TYPE Factor Factor 7 levels WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, CONT-T

Two fields are different between the two versions of the file. Column 4, Erected is a date stored as an integer value in dataframe 1 (df1) and Erected is a 4-level factor value in df2, seemingly a time period name (e.g. this bridge was erected in the Crafts era).

The other field is column 6, Length, which is an integer value in df1 (presumingly a measurement in feet) and a 3-level factor in df2, Short, Medium, Long. I feel like I may have lost something with the time-period factors for Erected and not so much with short. medium and long.

Before I list out what fields I want to keep and detail my data transformation I have one more mystery to solve. Why does column 13, Type, have 7 levels and only list 6 labels. I thought it might be NA, but that doesn’t seem to be it. Running the Summary() should help clear this up.

summary(df1, 10)
##        V1     V2           V3              V4              V5    
##  E1     : 1   A:49   Min.   : 1.00   Min.   :1818   AQUEDUCT: 4  
##  E10    : 1   M:41   1st Qu.:15.50   1st Qu.:1884   HIGHWAY :71  
##  E100   : 1   O:15   Median :27.00   Median :1903   RR      :32  
##  E101   : 1   Y: 3   Mean   :25.98   Mean   :1905   WALK    : 1  
##  E102   : 1          3rd Qu.:37.50   3rd Qu.:1928                
##  E103   : 1          Max.   :52.00   Max.   :1986                
##  E105   : 1          NA's   :1                                   
##  E107   : 1                                                      
##  E107N  : 1                                                      
##  (Other):99                                                      
##        V6             V7          V8           V9        V10    
##  Min.   : 804   Min.   :1.00   G   :80   DECK   :15   IRON :11  
##  1st Qu.:1000   1st Qu.:2.00   N   :26   THROUGH:87   STEEL:79  
##  Median :1300   Median :2.00   NA's: 2   NA's   : 6   WOOD :16  
##  Mean   :1567   Mean   :2.63                          NA's : 2  
##  3rd Qu.:2000   3rd Qu.:4.00                                    
##  Max.   :4558   Max.   :6.00                                    
##  NA's   :27     NA's   :16                                      
##                                                                 
##                                                                 
##                                                                 
##      V11       V12           V13    
##  LONG  :30   F   :58   ARCH    :13  
##  MEDIUM:53   S   :30   CANTILEV:11  
##  SHORT : 9   S-F :15   CONT-T  :10  
##  NA's  :16   NA's: 5   NIL     : 1  
##                        SIMPLE-T:44  
##                        SUSPEN  :11  
##                        WOOD    :16  
##                        NA's    : 2  
##                                     
## 

This shows that there is one NIL in the data for v13 that was not listed as a type in the data description. It is probably not a real type an should be counted as an NA.

I would like a data set that has the following fields. We will do this in the next section.

  • River, replace code with name

  • Erected - Date, from version 1, keep as number

  • Purpose, keep as factor

  • Lanes, make into a factor

  • Material, keep as factor

  • Erected - Period, from version 2, keep as factor

Data Transformation

We will start by taking a subset of df1 to start building our final dataframe. We will add the column from df2 for the Erected Time Period. We then need to replace the river names and change Lanes to a factor. We will run the Structure() and Summary functions on our new dataframe to wrap up.

To change the river names from the codes A, M, O to Allegheny, Monongahela and Ohio we need to have Allegheny, Monongahela and Ohio levels for the factor. There is a levels command that allows you to add levels, but I decided to covert the codes to character values, change the names and then convert back. I also discovered a Y code in River that is not addressed in the data description and does not refer to a Pittsburgh river, creek, or run I could find. It probably means NA in some way.

df <- subset(df1, select = c(V2, V4, V5, V7, V10))
df <- cbind(df, df2$V4)
fields <- c("River", "Date Erected", "Purpose", "Lanes", "Material", "Period Erected")
colnames(df) <- fields

river <- as.character(df$River)
river[river == "A"] <- "Allegheny"
river[river == "M"] <- "Monongahela"
river[river == "O"] <- "Ohio"
df$River <- as.factor(river)

df$Lanes <- as.factor(df$Lanes)

str(df)
## 'data.frame':    108 obs. of  6 variables:
##  $ River         : Factor w/ 4 levels "Allegheny","Monongahela",..: 2 1 1 1 2 1 1 2 1 1 ...
##  $ Date Erected  : int  1818 1819 1829 1837 1838 1840 1844 1846 1848 1851 ...
##  $ Purpose       : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
##  $ Lanes         : Factor w/ 4 levels "1","2","4","6": 2 2 1 2 2 2 1 2 1 2 ...
##  $ Material      : Factor w/ 3 levels "IRON","STEEL",..: 3 3 3 3 3 3 1 1 3 3 ...
##  $ Period Erected: Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(df)
##          River     Date Erected      Purpose    Lanes     Material 
##  Allegheny  :49   Min.   :1818   AQUEDUCT: 4   1   : 4   IRON :11  
##  Monongahela:41   1st Qu.:1884   HIGHWAY :71   2   :61   STEEL:79  
##  Ohio       :15   Median :1903   RR      :32   4   :23   WOOD :16  
##  Y          : 3   Mean   :1905   WALK    : 1   6   : 4   NA's : 2  
##                   3rd Qu.:1928                 NA's:16             
##                   Max.   :1986                                     
##   Period Erected
##  CRAFTS  :18    
##  EMERGING:15    
##  MATURE  :54    
##  MODERN  :21    
##                 
## 

This accomplished what I set out to do. I like that the summary function can easily tell us how many bridges of each lane type there are and that leaving Date Erected a number gives us useful information in minimum. maximum and median.