1 Introduction

1.1 Basic Information and Purpose of Project

Crystallography is the branch of science devoted to the study of molecular and crystalline structure and properties, with far-reaching applications in mineralogy, chemistry, physics, mathematics, biology, metallurgy and materials science.

The Cambridge Crystallographic Data Centre (CCDC) are world-leading experts in structural chemistry data, software, and knowledge for materials and life science research and application. CCDC compiles and distribute the Cambridge Structural Database (CSD), a certified trusted database of fully curated and enhanced organic and metal-organic structures, used by researchers across the globe.

Our purpose was to create code, which allows to analyse faster a huge amount of data from CSD.

1.2 Description of Parameters

It is possible to take several structures parameters from CSD and export them into .csv file. The basic description of the parameters used in this analysis is below. You can find more information on eternal links in some definition.

Refcode - unique code for compound structure in CSD.
R - the final R factor is one measure of model quality.
Space Group - the symmetry group of a three-dimensional crystal pattern.
Space Group Number - the unique number of space group in order from the lowest symmetry to the highest.
No.of.Coordinates - the number of parameters used in crystal refinement.
Z - the number of formula units in the unit cell.
T - the temperature of the experiment in Kelvin.
Dens - density values calculated from the crystal cell and contents. The units are megagrams per cubic meter (grams per cubic centimeter).
a, b, c, Alpha, Beta, Gamma - parameters of Unit Cell
V - cell volume in cubic angstroms.
Pub. Year - the year of structure publication.
Unique Chemical Units - the number of unique molecules or ions etc. which are present in the crystal structure.
Crystal Family - the smallest set of space groups containing, for any of its members, all space groups of the Bravais class and all space groups of the geometric crystal class to which this member belongs.

1.3 Information about data set

A total number of structures in CSD is over one million, so we decided to restrict the number of our data used in the analysis to compounds with hydroxyl (-OH) group (only 9723 structures).

Other restrictions were:

charge zero of the hydroxyl group (0ve)
oxygen had to be connected only with carbon and hydrogen (T2)
carbon had to be connected with other carbon or hydrogen (R)
all atoms with (a) are acyclic.

However, it is still possible to adjust our code for other specific searches!

2 Preparation

2.1 Read .csv file

It is necessary to read any data set generated by CCDC ConQuest software.

structures <- read.csv2("alkohole.csv")

2.2 Packages

Check, installation and loading of required packages.

requiredPackages = c("ggplot2","ggrepel", "dplyr","tidyr", "plotly" ) #list of required packages
for(i in requiredPackages) {if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages) {library(i,character.only = TRUE)}

3 Data set analysis

3.1 Adding Crystal Family

It is not possible to receive information about crystal family directly from CCDC ConQuest software, so we decided to create another parameter named “Crystal Family” based on Space Group Number for each structure.

structures<- structures %>% 
  mutate(Crystal.Family = case_when(between(Space.Gp..Number..I., 1 ,2) ~ "triclinic",
                                     between(Space.Gp..Number..I., 3 ,15) ~ "monoclinic",
                                     between(Space.Gp..Number..I., 16 ,74) ~ "orthorhombic",
                                     between(Space.Gp..Number..I., 75 ,142) ~ "tetragonal",
                                     between(Space.Gp..Number..I., 143 ,194) ~ "hexagonal",
                                     between(Space.Gp..Number..I., 195 ,230) ~ "cubic"))

3.2 Rename of parameters from raw data set

In our opinion, default names were not convenient to use in further visualisation, so we decided to prepare more transparent names.

structures <- rename(structures, R = R.factor..R. ,
                   Space.Group = Space.Gp..Symbol..L.,
                   Space.Group.Number = Space.Gp..Number..I.,
                   No.of.Coordinates = No..of.Coordinates..I.,
                   Z = Z.Value..R., 
                   Z.prime = Z.Prime..R.,
                   T = Study.Temp...I.,
                   Dens = Calc..Density..R.,
                   a = a..R.,
                   b = b..R.,
                   c = c..R.,
                   Alpha = Alpha..R.,
                   Beta = Beta..R.,
                   Gamma = Gamma..R.,
                   V = Cell.Volume..R.,
                   Pub.Year = Publication.Year..I.,
                   Unique.Chem.Units = Unique.Chemical.Units..I.,
                   Name = Compound.Name..L.)

3.3 Search for higher metric symmetry

Huge data sets like this may have some problems. In our case, we decided to check if all structures in our set have a well-determined space group or if it was possible to solve structure in higher symmetry.

Each crystal family has strictly defined unit cell parameters. We checked if unit cell parameters from CSD matched with corresponding crystal family.

We found ten structures with the unit cell parameter can be assigned to a higher symmetric space group. It can be accidental compatibility or a mistake. In this case, the structures listed below should be analysed deeper.

structures <- structures %>% 
  mutate(Compatibility  = case_when(Crystal.Family=="triclinic" & 
                                      ((round(Alpha,2)==90 & round(Beta,2)==90) | 
                                       (round(Alpha,2)==90 & round(Gamma,2)==90) | 
                                       (round(Gamma,2)==90 & round(Beta,2)==90))~ "NO",
                                    Crystal.Family=="monoclinic" & 
                                      (round(Alpha,2)==90 & round(Beta,2)==90 & round(Gamma,2)==90) ~ "NO",
                                    Crystal.Family=="orthorhombic" & 
                                      ((round(a,2)==90 & round(b,2)==90) | 
                                       (round(a,2)==90 & round(c,2)==90) | 
                                       (round(b,2)==90 & round(c,2)==90))~ "NO",
                                    Crystal.Family=="tetragonal" & 
                                      (round(a,2)==90 & round(b,2)==90 & round(c,2)==90) ~ "NO"))
                                  

structures %>%  filter(Compatibility=="NO") %>%  select(Refcode, R, Space.Group, a, b, c, Alpha, Beta, Gamma, Pub.Year)

4 Data Visualisation

4.1 Number of structures by year

This bar chart shows the number of published structures in CSD each year. It is clearly seen that after 2005 scientists published more structures with -OH group. Probably this effect is caused by greater access to equipment and/or the orientation of studies towards compounds with the hydroxyl group. Obviously, the number of structures from 2019 is still increasing because research is still ongoing.

structures3 <- structures %>% 
  group_by(Pub.Year) %>% 
  summarise(number = n())

plot_ly(
  x = structures3$Pub.Year,
  y = structures3$number,
  name = "Number of structures by year",
  type = "bar",
  marker = list(color = 'rgb(158,202,225)',
                      line = list(color = 'rgb(8,48,107)',
                                  width = 1.5))) %>%
  layout(title = "Number of structures by year", 
         yaxis = list(title = 'Count'), 
         xaxis = list(title = 'Year'))

4.2 Number of structure by crystal family

Bar chart presents that compounds with -OH group often crystallise in monoclinic space system. It coincides with our expectations, because organic compounds usually crystallize in low symetric space groups. In that reason, we can not observe any structures from cubic system in this chart.

structures4 <- structures %>% 
  group_by(Crystal.Family) %>% 
  summarise(number = n())

target <- c('triclinic', 'monoclinic', 'orthorhombic', 'tetragonal', 'hexagonal', 'cubic')

structures5 <- structures4[match(target, structures4$Crystal.Family),]

plot_ly(data = structures5,
  x = ~Crystal.Family,
  y = ~number,
  name = "Crystal.Family",
  type = "bar",
  marker = list(color = 'rgb(158,202,225)',
                      line = list(color = 'rgb(8,48,107)',
                                  width = 1.5))) %>%
  layout(title = "Number of structure by crystal family", 
         yaxis = list(title = 'Count'), 
         xaxis = list(title = 'Crystal Family',
                      categoryorder = "array",
                      categoryarray = ~Crystal.Family))

4.3 R factor vs Publication Year

In this chart we can clearly see that, until 1965 the values of R factor for structures were above 10%, because equipment and refinement software was not so advanced as nowadays. After the 1965 the number of structures with the R factor around 5% is slowly increasing. We can also observe, that after 2000, the values of R factor for some structures are really low (around 3% or less). Usually, values less than 7% are normally expected. IUCr recommends that higher values should be accompanied by a suitable explanation for the publication.

Structures with R factor above 25%, which can be very interesting in deeper, crystallographic analysis are listed above the chart.

structures2 <- structures[structures$R!=0,]

ggplot()+
  geom_point(data = structures2, aes(y = R, x = Pub.Year),
             color = "black", 
             fill = "blue", 
             shape = 21 , 
             alpha = 0.2, 
             size = 2) + 
  geom_label_repel(data =  structures2[structures2$R >25.0,],  
            aes(x = Pub.Year, y=R, label = Refcode),
            colour = 'black',
            fill = "lightgoldenrod1",
            size = 4, 
            hjust = 0, 
            vjust = 0) +
  labs(title = 'R factor vs Publication Year') +
  theme(plot.title = element_text(face = "bold", size = 15),
        axis.title = element_text(face = "italic", size = 13)) +
  scale_x_continuous(breaks = seq(1900, 2020, by = 10)) +
  scale_y_continuous(breaks = seq(0, 50, by = 5))

structures %>%  filter(R > 25) %>%  select(Refcode, R, Space.Group, No.of.Coordinates, T, Unique.Chem.Units , Pub.Year)

4.4 R factor vs Volume

For small volumes of the unit cell, there is no visible relation between volume and R factor, but when the volume of the unit cell is increasing we can notice a small increase of R factor for structures. For the huge volume of the unit cell (more than 25000 cubic angstroms), R factor for most structures is higher than 10%. This shows, that it is really hard to refine this kind of structures with good R.

Structures with V above 25000 cubic angstroms are listed below. We can see that these compounds have usually the high number of coordinates and a bigger number of unique chemical units. Several temperatures were listed as 0K - it is an obvious mistake, and this should be corrected by publication authors.

ggplot()+
  geom_point(data = structures2, aes(y = R, x = V),
             color = "black", 
             fill = "blue", 
             shape = 21 , 
             alpha = 0.2, 
             size = 2) + 
  geom_label_repel(data =  structures2[structures2$V >25000.0,],  
                   aes(x = V, y=R, label = Refcode),
                   colour = 'black',
                   fill = "lightgoldenrod1",
                   size = 4, 
                   hjust = 0, 
                   vjust = 0) +
  labs(title = 'R factor vs Volume') +
  theme(plot.title = element_text(face = "bold", size = 15),
        axis.title = element_text(face = "italic", size = 13)) +
  scale_y_continuous(breaks = seq(0, 50, by = 5))

structures %>%  filter(V >25000.0) %>%  select(Refcode, R, Space.Group, No.of.Coordinates, T, V, Unique.Chem.Units , Pub.Year)

4.5 R factor vs No. of Coordinates

For the small number of coordinates, it is possible to reach very small values of R factor. For a higher number of parameters to refine it is harder to get good values of R. It may be occurred by existing distortions (e.g. disorder) in crystal that need to be refined.

Structures with the number of coordinates above 600 are listed below. Usually, more coordinates are necessary for refinement with low symmetry space groups, with big volumes and more unique chemical units inside the unit cell.

ggplot()+
  geom_point(data = structures2, aes(y = R, x = No.of.Coordinates),
             color = "black", 
             fill = "blue", 
             shape = 21 , 
             alpha = 0.2, 
             size = 2) + 
  geom_label_repel(data =  structures2[structures2$No.of.Coordinates >600,],  
                    aes(x = No.of.Coordinates, y=R, label = Refcode),
                    colour = 'black',
                    fill = "lightgoldenrod1",
                    size = 4, 
                    hjust = -0.5, 
                    vjust = 0) +
  labs(title = 'R factor vs No. of Coordinates') +
  theme(plot.title = element_text(face = "bold", size = 15),
        axis.title = element_text(face = "italic", size = 13)) +
  scale_y_continuous(breaks = seq(0, 50, by = 5))

structures %>%  filter(No.of.Coordinates >600) %>%  select(Refcode, R, Space.Group, No.of.Coordinates, T, V, Unique.Chem.Units , Pub.Year)

5 Summary

Our code allows to fast analysis for any group of structures. We could create several charts to make basic analysis of data that we generate by CCDC ConQuest software. Depending on the demands, it is possible to prepare more fitted charts to the studied topic.

References:

Cambridge Structure Database (CSD) analysis with R in an example of compounds with the hydroxyl group

Bernadeta Nowosielska, Pawel Socha