Your task is to choose one data set, then study the data and its associated description of the data (i.e. “data dictionary”). You should take the data, and create an R data frame with a subset of the columns (and if you like rows) in the data set. Your deliverable is the R code to perform these transformation tasks.
The data set I will examine and transform is the “Pittsburgh Bridges” data set from the archive at UC Irvine. Here is the data description or dictionary as given in the file “bridge.names.txt.
Attribute Information:
The type field state whether a property is continuous/integer (c) or nominal (n).
For properties with c,n type, the range of continuous numbers is given first
and the possible values of the nominal follow the semi-colon.| Column | Name | Type | Possible Values | Comments |
|---|---|---|---|---|
| 1 | IDENTIF | - | - | Identifier of the examples |
| 2 | RIVER | n | A, M, O | |
| 3 | LOCATION | n | 1 to 52 | |
| 4 | ERECTED | c,n | 1818-1986 ; CRAFTS, EMERGING, MATURE, MODERN | |
| 5 | PURPOSE | n | WALK, AQUEDUCT, RR, HIGHWAY | |
| 6 | LENGTH | c,n | 804-4558 ; SHORT, MEDIUM, LONG | |
| 7 | LANES | c,n | 1, 2, 4, 6 ; 1, 2, 4, 6 | |
| 8 | CLEAR-G | n | N, G | |
| 9 | T-OR-D | n | THROUGH, DECK | |
| 10 | MATERIAL | n | WOOD, IRON, STEEL | |
| 11 | SPAN | n | SHORT, MEDUIM, LONG | |
| 12 | REL-L | n | S, S-F, F | |
| 13 | TYPE | n | WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, CONT-T |
There are two versions of the data file as described below.
Number of instances: 108
Relevant Information:
There are two versions to the database:
V1 contains the original examples and
V2 contains descriptions after discretizing numeric properties.We will start by examining both data sets to see if we can discern the difference “discretizing numeric properties” makes to the data sets. We do this by reading in both versions of the data file and putting them into separate dataframes. We then run the Structure Function on each one and use View to browse the data.
I learned a few things loading data from GitHub. I found the “raw” button in GitHub and used it for the first time. This explained the word raw in the path of the example. I also found that if I used the example as written (replacing my path), I would lose an observation. I believe this is because the example assumes a header row, so I was losing the first row. This can be solved by adding “header = FALSE” to the read statement. I also added “na.strings =”?“” to my read and decided to write it all as one statement. A function like getURL() that returns a full path when given a website and a file name would be of great use for a data repository site. However, this function seems to require the full path, so only seems to save space in the read statement.
Load Data From GitHub Example
library(RCurl)
x <- getURL(“https://raw.github.com/aronlindberg/latent_growth_classes/master/LGC_data.csv”)
y <- read.csv(text = x)
library(RCurl)
## Loading required package: bitops
df1 <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/bridges.data.version1.txt", header = FALSE, na.strings = "?")
df2 <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/bridges.data.version2.txt", header = FALSE, na.strings = "?")
str(df1)
## 'data.frame': 108 obs. of 13 variables:
## $ V1 : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
## $ V2 : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
## $ V3 : num 3 25 39 29 23 27 28 3 39 29 ...
## $ V4 : int 1818 1819 1829 1837 1838 1840 1844 1846 1848 1851 ...
## $ V5 : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
## $ V6 : int NA 1037 NA 1000 NA 990 1000 1500 NA 1000 ...
## $ V7 : int 2 2 1 2 2 2 1 2 1 2 ...
## $ V8 : Factor w/ 2 levels "G","N": 2 2 2 2 2 2 2 2 2 2 ...
## $ V9 : Factor w/ 2 levels "DECK","THROUGH": 2 2 2 2 2 2 2 2 1 2 ...
## $ V10: Factor w/ 3 levels "IRON","STEEL",..: 3 3 3 3 3 3 1 1 3 3 ...
## $ V11: Factor w/ 3 levels "LONG","MEDIUM",..: 3 3 NA 3 NA 2 3 3 NA 2 ...
## $ V12: Factor w/ 3 levels "F","S","S-F": 2 2 2 2 2 2 2 2 2 2 ...
## $ V13: Factor w/ 7 levels "ARCH","CANTILEV",..: 7 7 7 7 7 7 6 6 7 7 ...
str(df2)
## 'data.frame': 108 obs. of 13 variables:
## $ V1 : Factor w/ 108 levels "E1","E10","E100",..: 1 21 32 54 65 76 87 98 2 12 ...
## $ V2 : Factor w/ 4 levels "A","M","O","Y": 2 1 1 1 2 1 1 2 1 1 ...
## $ V3 : num 3 25 39 29 23 27 28 3 39 29 ...
## $ V4 : Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ V5 : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
## $ V6 : Factor w/ 3 levels "LONG","MEDIUM",..: NA 2 NA 2 NA 3 2 2 NA 2 ...
## $ V7 : int 2 2 1 2 2 2 1 2 1 2 ...
## $ V8 : Factor w/ 2 levels "G","N": 2 2 2 2 2 2 2 2 2 2 ...
## $ V9 : Factor w/ 2 levels "DECK","THROUGH": 2 2 2 2 2 2 2 2 1 2 ...
## $ V10: Factor w/ 3 levels "IRON","STEEL",..: 3 3 3 3 3 3 1 1 3 3 ...
## $ V11: Factor w/ 3 levels "LONG","MEDIUM",..: 3 3 NA 3 NA 2 3 3 NA 2 ...
## $ V12: Factor w/ 3 levels "F","S","S-F": 2 2 2 2 2 2 2 2 2 2 ...
## $ V13: Factor w/ 7 levels "ARCH","CANTILEV",..: 7 7 7 7 7 7 6 6 7 7 ...
Looking at the two files we see that both have 108 Observations and 13 Variables, so “discretizing numeric properties” did not change the number of fields. Looking at the results of the Structure() function we see some difference in the variables, which I will summarize in the following table.
| Column | Name | Type df1 | Type df2 | Comments |
|---|---|---|---|---|
| 1 | IDENTIF | - | - | Identifier of the examples, E1, E2 … |
| 2 | RIVER | Factor | Factor | 4 levels A, M, O, Y |
| 3 | LOCATION | num | num | 1 to 52 |
| 4 | ERECTED | int | Factor | df1 = 1818-1986; df2 = CRAFTS, EMERGING, MATURE, MODERN |
| 5 | PURPOSE | Factor | Factor | WALK, AQUEDUCT, RR, HIGHWAY |
| 6 | LENGTH | int | Factor | df1 = 804-4558; df2 = SHORT, MEDIUM, LONG |
| 7 | LANES | int | int | 1, 2, 4, 6 |
| 8 | CLEAR-G | Factor | Factor | 2 levels N, G |
| 9 | T-OR-D | Factor | Factor | 2 levels THROUGH, DECK |
| 10 | MATERIAL | Factor | Factor | 3 levels WOOD, IRON, STEEL |
| 11 | SPAN | Factor | Factor | 3 levels SHORT, MEDUIM, LONG |
| 12 | REL-L | Factor | Factor | 3 levels S, S-F, F |
| 13 | TYPE | Factor | Factor | 7 levels WOOD, SUSPEN, SIMPLE-T, ARCH, CANTILEV, CONT-T |
Two fields are different between the two versions of the file. Column 4, Erected is a date stored as an integer value in dataframe 1 (df1) and Erected is a 4-level factor value in df2, seemingly a time period name (e.g. this bridge was erected in the Crafts era).
The other field is column 6, Length, which is an integer value in df1 (presumingly a measurement in feet) and a 3-level factor in df2, Short, Medium, Long. I feel like I may have lost something with the time-period factors for Erected and not so much with short. medium and long.
Before I list out what fields I want to keep and detail my data transformation I have one more mystery to solve. Why does column 13, Type, have 7 levels and only list 6 labels. I thought it might be NA, but that doesn’t seem to be it. Running the Summary() should help clear this up.
summary(df1, 10)
## V1 V2 V3 V4 V5
## E1 : 1 A:49 Min. : 1.00 Min. :1818 AQUEDUCT: 4
## E10 : 1 M:41 1st Qu.:15.50 1st Qu.:1884 HIGHWAY :71
## E100 : 1 O:15 Median :27.00 Median :1903 RR :32
## E101 : 1 Y: 3 Mean :25.98 Mean :1905 WALK : 1
## E102 : 1 3rd Qu.:37.50 3rd Qu.:1928
## E103 : 1 Max. :52.00 Max. :1986
## E105 : 1 NA's :1
## E107 : 1
## E107N : 1
## (Other):99
## V6 V7 V8 V9 V10
## Min. : 804 Min. :1.00 G :80 DECK :15 IRON :11
## 1st Qu.:1000 1st Qu.:2.00 N :26 THROUGH:87 STEEL:79
## Median :1300 Median :2.00 NA's: 2 NA's : 6 WOOD :16
## Mean :1567 Mean :2.63 NA's : 2
## 3rd Qu.:2000 3rd Qu.:4.00
## Max. :4558 Max. :6.00
## NA's :27 NA's :16
##
##
##
## V11 V12 V13
## LONG :30 F :58 ARCH :13
## MEDIUM:53 S :30 CANTILEV:11
## SHORT : 9 S-F :15 CONT-T :10
## NA's :16 NA's: 5 NIL : 1
## SIMPLE-T:44
## SUSPEN :11
## WOOD :16
## NA's : 2
##
##
This shows that there is one NIL in the data for v13 that was not listed as a type in the data description. It is probably not a real type an should be counted as an NA.
I would like a data set that has the following fields. We will do this in the next section.
River, replace code with name
Erected - Date, from version 1, keep as number
Purpose, keep as factor
Lanes, make into a factor
Material, keep as factor
Erected - Period, from version 2, keep as factor
We will start by taking a subset of df1 to start building our final dataframe. We will add the column from df2 for the Erected Time Period. We then need to replace the river names and change Lanes to a factor. We will run the Structure() and Summary functions on our new dataframe to wrap up.
To change the river names from the codes A, M, O to Allegheny, Monongahela and Ohio we need to have Allegheny, Monongahela and Ohio levels for the factor. There is a levels command that allows you to add levels, but I decided to covert the codes to character values, change the names and then convert back. I also discovered a Y code in River that is not addressed in the data description and does not refer to a Pittsburgh river, creek, or run I could find. It probably means NA in some way.
df <- subset(df1, select = c(V2, V4, V5, V7, V10))
df <- cbind(df, df2$V4)
fields <- c("River", "Date Erected", "Purpose", "Lanes", "Material", "Period Erected")
colnames(df) <- fields
river <- as.character(df$River)
river[river == "A"] <- "Allegheny"
river[river == "M"] <- "Monongahela"
river[river == "O"] <- "Ohio"
df$River <- as.factor(river)
df$Lanes <- as.factor(df$Lanes)
str(df)
## 'data.frame': 108 obs. of 6 variables:
## $ River : Factor w/ 4 levels "Allegheny","Monongahela",..: 2 1 1 1 2 1 1 2 1 1 ...
## $ Date Erected : int 1818 1819 1829 1837 1838 1840 1844 1846 1848 1851 ...
## $ Purpose : Factor w/ 4 levels "AQUEDUCT","HIGHWAY",..: 2 2 1 2 2 2 1 2 1 2 ...
## $ Lanes : Factor w/ 4 levels "1","2","4","6": 2 2 1 2 2 2 1 2 1 2 ...
## $ Material : Factor w/ 3 levels "IRON","STEEL",..: 3 3 3 3 3 3 1 1 3 3 ...
## $ Period Erected: Factor w/ 4 levels "CRAFTS","EMERGING",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(df)
## River Date Erected Purpose Lanes Material
## Allegheny :49 Min. :1818 AQUEDUCT: 4 1 : 4 IRON :11
## Monongahela:41 1st Qu.:1884 HIGHWAY :71 2 :61 STEEL:79
## Ohio :15 Median :1903 RR :32 4 :23 WOOD :16
## Y : 3 Mean :1905 WALK : 1 6 : 4 NA's : 2
## 3rd Qu.:1928 NA's:16
## Max. :1986
## Period Erected
## CRAFTS :18
## EMERGING:15
## MATURE :54
## MODERN :21
##
##
This accomplished what I set out to do. I like that the summary function can easily tell us how many bridges of each lane type there are and that leaving Date Erected a number gives us useful information in minimum. maximum and median.