Crystallography is the branch of science devoted to the study of molecular and crystalline structure and properties, with far-reaching applications in mineralogy, chemistry, physics, mathematics, biology, metallurgy and materials science.
The Cambridge Crystallographic Data Centre (CCDC) are world-leading experts in structural chemistry data, software, and knowledge for materials and life science research and application. CCDC compiles and distribute the Cambridge Structural Database (CSD), a certified trusted database of fully curated and enhanced organic and metal-organic structures, used by researchers across the globe.
Our purpose was to create code, which allows to analyse faster a huge amount of data from CSD.
It is possible to take several structures parameters from CSD and export them into .csv file. The basic description of the parameters used in this analysis is below. You can find more information on eternal links in some definition.
A total number of structures in CSD is over one million, so we decided to restrict the number of our data used in the analysis to compounds with hydroxyl (-OH) group (only 9723 structures).
Other restrictions were:
However, it is still possible to adjust our code for other specific searches!
It is necessary to read any data set generated by CCDC ConQuest software.
structures <- read.csv2("alkohole.csv")
Check, installation and loading of required packages.
requiredPackages = c("ggplot2","ggrepel", "dplyr","tidyr", "plotly" ) #list of required packages
for(i in requiredPackages) {if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages) {library(i,character.only = TRUE)}
It is not possible to receive information about crystal family directly from CCDC ConQuest software, so we decided to create another parameter named “Crystal Family” based on Space Group Number for each structure.
structures<- structures %>%
mutate(Crystal.Family = case_when(between(Space.Gp..Number..I., 1 ,2) ~ "triclinic",
between(Space.Gp..Number..I., 3 ,15) ~ "monoclinic",
between(Space.Gp..Number..I., 16 ,74) ~ "orthorhombic",
between(Space.Gp..Number..I., 75 ,142) ~ "tetragonal",
between(Space.Gp..Number..I., 143 ,194) ~ "hexagonal",
between(Space.Gp..Number..I., 195 ,230) ~ "cubic"))
In our opinion, default names were not convenient to use in further visualisation, so we decided to prepare more transparent names.
structures <- rename(structures, R = R.factor..R. ,
Space.Group = Space.Gp..Symbol..L.,
Space.Group.Number = Space.Gp..Number..I.,
No.of.Coordinates = No..of.Coordinates..I.,
Z = Z.Value..R.,
Z.prime = Z.Prime..R.,
T = Study.Temp...I.,
Dens = Calc..Density..R.,
a = a..R.,
b = b..R.,
c = c..R.,
Alpha = Alpha..R.,
Beta = Beta..R.,
Gamma = Gamma..R.,
V = Cell.Volume..R.,
Pub.Year = Publication.Year..I.,
Unique.Chem.Units = Unique.Chemical.Units..I.,
Name = Compound.Name..L.)
Huge data sets like this may have some problems. In our case, we decided to check if all structures in our set have a well-determined space group or if it was possible to solve structure in higher symmetry.
Each crystal family has strictly defined unit cell parameters. We checked if unit cell parameters from CSD matched with corresponding crystal family.
We found ten structures with the unit cell parameter can be assigned to a higher symmetric space group. It can be accidental compatibility or a mistake. In this case, the structures listed below should be analysed deeper.
structures <- structures %>%
mutate(Compatibility = case_when(Crystal.Family=="triclinic" &
((round(Alpha,2)==90 & round(Beta,2)==90) |
(round(Alpha,2)==90 & round(Gamma,2)==90) |
(round(Gamma,2)==90 & round(Beta,2)==90))~ "NO",
Crystal.Family=="monoclinic" &
(round(Alpha,2)==90 & round(Beta,2)==90 & round(Gamma,2)==90) ~ "NO",
Crystal.Family=="orthorhombic" &
((round(a,2)==90 & round(b,2)==90) |
(round(a,2)==90 & round(c,2)==90) |
(round(b,2)==90 & round(c,2)==90))~ "NO",
Crystal.Family=="tetragonal" &
(round(a,2)==90 & round(b,2)==90 & round(c,2)==90) ~ "NO"))
structures %>% filter(Compatibility=="NO") %>% select(Refcode, R, Space.Group, a, b, c, Alpha, Beta, Gamma, Pub.Year)
This bar chart shows the number of published structures in CSD each year. It is clearly seen that after 2005 scientists published more structures with -OH group. Probably this effect is caused by greater access to equipment and/or the orientation of studies towards compounds with the hydroxyl group. Obviously, the number of structures from 2019 is still increasing because research is still ongoing.
structures3 <- structures %>%
group_by(Pub.Year) %>%
summarise(number = n())
plot_ly(
x = structures3$Pub.Year,
y = structures3$number,
name = "Number of structures by year",
type = "bar",
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))) %>%
layout(title = "Number of structures by year",
yaxis = list(title = 'Count'),
xaxis = list(title = 'Year'))
Bar chart presents that compounds with -OH group often crystallise in monoclinic space system. It coincides with our expectations, because organic compounds usually crystallize in low symetric space groups. In that reason, we can not observe any structures from cubic system in this chart.
structures4 <- structures %>%
group_by(Crystal.Family) %>%
summarise(number = n())
target <- c('triclinic', 'monoclinic', 'orthorhombic', 'tetragonal', 'hexagonal', 'cubic')
structures5 <- structures4[match(target, structures4$Crystal.Family),]
plot_ly(data = structures5,
x = ~Crystal.Family,
y = ~number,
name = "Crystal.Family",
type = "bar",
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))) %>%
layout(title = "Number of structure by crystal family",
yaxis = list(title = 'Count'),
xaxis = list(title = 'Crystal Family',
categoryorder = "array",
categoryarray = ~Crystal.Family))
In this chart we can clearly see that, until 1965 the values of R factor for structures were above 10%, because equipment and refinement software was not so advanced as nowadays. After the 1965 the number of structures with the R factor around 5% is slowly increasing. We can also observe, that after 2000, the values of R factor for some structures are really low (around 3% or less). Usually, values less than 7% are normally expected. IUCr recommends that higher values should be accompanied by a suitable explanation for the publication.
Structures with R factor above 25%, which can be very interesting in deeper, crystallographic analysis are listed above the chart.
structures2 <- structures[structures$R!=0,]
ggplot()+
geom_point(data = structures2, aes(y = R, x = Pub.Year),
color = "black",
fill = "blue",
shape = 21 ,
alpha = 0.2,
size = 2) +
geom_label_repel(data = structures2[structures2$R >25.0,],
aes(x = Pub.Year, y=R, label = Refcode),
colour = 'black',
fill = "lightgoldenrod1",
size = 4,
hjust = 0,
vjust = 0) +
labs(title = 'R factor vs Publication Year') +
theme(plot.title = element_text(face = "bold", size = 15),
axis.title = element_text(face = "italic", size = 13)) +
scale_x_continuous(breaks = seq(1900, 2020, by = 10)) +
scale_y_continuous(breaks = seq(0, 50, by = 5))
structures %>% filter(R > 25) %>% select(Refcode, R, Space.Group, No.of.Coordinates, T, Unique.Chem.Units , Pub.Year)
For small volumes of the unit cell, there is no visible relation between volume and R factor, but when the volume of the unit cell is increasing we can notice a small increase of R factor for structures. For the huge volume of the unit cell (more than 25000 cubic angstroms), R factor for most structures is higher than 10%. This shows, that it is really hard to refine this kind of structures with good R.
Structures with V above 25000 cubic angstroms are listed below. We can see that these compounds have usually the high number of coordinates and a bigger number of unique chemical units. Several temperatures were listed as 0K - it is an obvious mistake, and this should be corrected by publication authors.
ggplot()+
geom_point(data = structures2, aes(y = R, x = V),
color = "black",
fill = "blue",
shape = 21 ,
alpha = 0.2,
size = 2) +
geom_label_repel(data = structures2[structures2$V >25000.0,],
aes(x = V, y=R, label = Refcode),
colour = 'black',
fill = "lightgoldenrod1",
size = 4,
hjust = 0,
vjust = 0) +
labs(title = 'R factor vs Volume') +
theme(plot.title = element_text(face = "bold", size = 15),
axis.title = element_text(face = "italic", size = 13)) +
scale_y_continuous(breaks = seq(0, 50, by = 5))
structures %>% filter(V >25000.0) %>% select(Refcode, R, Space.Group, No.of.Coordinates, T, V, Unique.Chem.Units , Pub.Year)
For the small number of coordinates, it is possible to reach very small values of R factor. For a higher number of parameters to refine it is harder to get good values of R. It may be occurred by existing distortions (e.g. disorder) in crystal that need to be refined.
Structures with the number of coordinates above 600 are listed below. Usually, more coordinates are necessary for refinement with low symmetry space groups, with big volumes and more unique chemical units inside the unit cell.
ggplot()+
geom_point(data = structures2, aes(y = R, x = No.of.Coordinates),
color = "black",
fill = "blue",
shape = 21 ,
alpha = 0.2,
size = 2) +
geom_label_repel(data = structures2[structures2$No.of.Coordinates >600,],
aes(x = No.of.Coordinates, y=R, label = Refcode),
colour = 'black',
fill = "lightgoldenrod1",
size = 4,
hjust = -0.5,
vjust = 0) +
labs(title = 'R factor vs No. of Coordinates') +
theme(plot.title = element_text(face = "bold", size = 15),
axis.title = element_text(face = "italic", size = 13)) +
scale_y_continuous(breaks = seq(0, 50, by = 5))
structures %>% filter(No.of.Coordinates >600) %>% select(Refcode, R, Space.Group, No.of.Coordinates, T, V, Unique.Chem.Units , Pub.Year)
Our code allows to fast analysis for any group of structures. We could create several charts to make basic analysis of data that we generate by CCDC ConQuest software. Depending on the demands, it is possible to prepare more fitted charts to the studied topic.
References: