Creating clean data for type 2 Superconductor data analysis

Introduction

This is an R Markdown document that takes raw superconductor data cut and pasted from a web site into two files (one for the chemical formula and one for the critical temperature), and processes it into a cleaned data frame.

The clean data frame will then be written to a file from where it can be read by another markdown document to do a data analysis.

Reading the raw data

The first step is to download and read the two text files containing the chemical formulae and critical temperatures respectively.

fileURL<-"http://roypmurphy.co.uk/wp-content/uploads/2014/11/rawFormulae.txt"
download.file(fileURL,destfile="rawFormulae.txt")
fileURL<-"http://roypmurphy.co.uk/wp-content/uploads/2014/11/rawTemperatures.txt"
download.file(fileURL,destfile="rawTemperatures.txt")

Cleaning up the temperature data

The temperature data cut and pasted from the website is character data within which is embedded a character sequence representing the numerical value of the critical temperature, a characer representing the unit of measurement (Celsius or Kelvin), and other characters.

In some cases a temperature range rather than a single value is given. This clean up will take that range and calculate a single value as the mid-point of the range, which will be used as the critical temperature value in the data frame created for the cleaned data.

Where a tilde(~) is used to indicate approximation, it will be removed.

First we read in the raw text containing the critical temperatures:

xRaw<-readLines("rawTemperatures.txt")

Next we remove blank lines:

xProc1<-xRaw[xRaw!=""]

Now we have a vector of lines containing numerical values and units of interest. We process those lines to remove the extraneous text and generate one vector with numerical values of temperature and a vector with the units of measurement for each temperature.

#remove text after the temperature unit indicator character for most cases
xProc1<-gsub('(.*[C|K]).*','\\1',xProc1)

#remove Tc text from all observations
xProc1<-gsub('Tc','',xProc1)

#remove tilde
xProc1<-gsub('~','',xProc1)

#remove + signs
xProc1<-gsub('\\+','',xProc1)

#remove leftover extraneous text to the right of the first K or C
xProc1<-gsub('(.*)\\(.*','\\1',xProc1)

#extract the unit of temerature into a vector
TempUnit<-gsub('.*([C|K]).*','\\1',xProc1)

#extract the numeric portion of the data into a vector
TempValChar<-gsub('[K|C|uK]','',xProc1)

#build a numeric vector of values from the character values
#convert Celsius values to Kelvin where necessary
noVals<-length(TempValChar)
TempValNum<-rep(NA, noVals)
for (i in 1:noVals) {
    if (length(grep("\\-",TempValChar[i]))>0)  {   
       Temp<-gsub("\\-","+",TempValChar[i])
       theVal<-eval(parse(text=Temp))/2
       }
    else {theVal<-eval(parse(text=TempValChar[i]))}
    if (length(grep("u",xProc1[i]))>0) {theVal<-theVal*10^-6}
    if (TempUnit[i]=="C") {
        theVal<-theVal + 273.15
        TempUnit[i]<-"K"
        }
    TempValNum[i]<-signif(theVal,3)
}
df <- data.frame(TempValNum,TempUnit)
names(df)<-c("CriticalTemp","UnitOfTemp")
nrow(df)

## [1] 144

write.table(df,"CriticalTemp.txt")

Cleaning up the Chemical Formulae Data

First we read in the raw text containing the chemical formulae:

xRaw<-readLines("rawFormulae.txt")

Next we remove blank lines:

xProc1<-xRaw[xRaw!=""]

Next we remove lines containing the text “As a”, which appears underneath some of the chemical formlae in the sentence “As a xxxx structure”:

xProc1<-xProc1[!grepl("As a",xProc1)]

Next we remove the remaining extraneous text after each formula:

xProc1<-gsub('([A-z]+) .*','\\1',xProc1)
#remove + signs
xProc1<-gsub('\\+','',xProc1)
#remove * signs
xProc1<-gsub('\\*','',xProc1)

Now we have a vector of text lines where each line contains a chemical formula. As a quick check, the length of the vector of formulae at this point is equal to the length of the data frame of critical temperatures, so we can be fairly sure that we have a vector where each formula has a corresponding critical temperature in the critical temperature data frame.

As a guide to to how to interpret the formulae, I have used the web site to guide me. Some of the formulae on the type 2 superconductor page have symbols in parentheses, the letters x and y used as subscripts, fractional subscripts for the quantities of elements in a substance.

#load the stringr package 
library(stringr)

#break up the formulae into pieces of text that each begin with a capital letter
Components_1<-gsub('([[:upper:]])', ' \\1', xProc1)

#extract only the text that represents an element
Components_2<-gsub('[^A-z]', ' ', Components_1)
Components_2<-gsub('[x|y]', ' ', Components_2)
Components_2<-gsub('\\s+', ' ', Components_2)
Components_2<-str_trim(Components_2)

#split each formula into a list of elements
element_lists<-strsplit(Components_2,' ')

Creating clean data for type 2 Superconductor data analysis

Roy Murphy

Saturday, November 01, 2014

Introduction

Reading the raw data

Cleaning up the temperature data

Cleaning up the Chemical Formulae Data