import pandas as pd
import os
def read_powcod_sql():
import sqlite3
= sqlite3.connect('cod2205.sq')
con = pd.read_sql_query('SELECT * FROM id',con)
df = sqlite3.connect('cod2205.sq.info')
con2 = pd.read_sql_query('SELECT * FROM info',con2)
df2
id = pd.to_numeric(df.id)
df.id = pd.to_numeric(df2.id)
df2.
= df.merge(df2, left_on="id", right_on="id")
combined return combined
if os.path.isfile("combined.pkl"):
= pd.read_pickle("combined.pkl")
combined else:
= read_powcod_sql()
combined print(combined.columns)
= combined[["spacegroup_x", "a", "b", "c", "alpha", "beta",
combined "gamma", "volume", "h", "k", "l", "nd", "dvalue",
"intensita", "type", "id"]]
"combined.pkl") combined.to_pickle(
Introduction
Neural networks have become a powerful tool for automating complex tasks in crystallography, including PXRD data analysis. For example, indexing, crystallite size and scale factor determination, as well as crystal system and space group classification. There are many more examples in the literature! In future posts, we’ll look at how to train our own neural networks for analysing PXRD data. However, before we get there, we’ll need a large volume of PXRD against which they can be trained.
To the best of my knowledge, there is no extant database of experimental PXRD patterns. However, there are several databases, such as the COD, CCDC’s CSD and ICDD’s PDF, which provide experimental crystal structures. These databases, which contain hundreds of thousands of crystal structures can then be used to calculate diffraction patterns. The CSD has a python API which allows PXRD data to be generated for each of the entries in the database. However, this is not optimised to run quickly, and as such, PXRD data would need to be generated in advance.
In this post, we’ll look at how we can efficiently calculate large numbers of diffraction patterns on-the-fly whilst training neural networks. By writing our own code to generate synthetic PXRD data, we will also be able to implement various data augmentation transformations. Such transformations make the calculated PXRD data look different from the perspective of a machine learning algorithm, but could still plausibly be produced by the same underlying material. This will help to make our future neural networks more robust and able to be applied to real-world diffraction data.
Synthetic Diffraction data using the PowCod database
I used the PowCod database produced by the Bari group as the source for the diffraction data. This database is based on the COD, and for each entry, a Powder X-ray diffraction pattern has been calculated. This is saved in PowCod in the form of the Miller indices of each reflection, the associated intensity, the d-spacing and other relevant information (e.g. multiplicity etc). Being able to read in this information directly, rather than having to calculate this for each of the crystal structures makes this work significantly easier.
As described earlier, PowCod provides lots of information. Most pertinent for this work is the unit cells, space groups, Miller indices and associated intensities for all of the crystal structures. In the next sections, we’ll look at how I filtered the PowCod database, and then discuss the PyTorch-based code I’ve written to generate relatively realistic looking PXRD data on-the-fly using my GPU.
I’ve included several methods of data augmentation:
- Modification of unit cells to simulate thermal expansion / contraction
- Modification of intensities using the March-Dollase method to simulate preferred orientation effects
- Modification of peak shapes, broadening and asymmetry to simulate different crystallite sizes/strains and instrumentation
- Modification of zero point errors to simulate different instruments
- Modification of noise to simulate different experimental data collection times / sample sizes
Filtering PowCod data
The PowCod database, which can be obtained here, is made up of several .sql files which need to be parsed to extract the desired information. In our case, what we are after is:
- Unit cells
- Crystal systems
- Space groups
- Miller indices
- Intensities
We’ll probably also need to use some additional information to help us along the way, for example, the d-spacing of the reflections.
As the files total only around 5.5 GB, I used pandas to read the database into memory and extract the information needed. I pickled the resultant dataframe to enable it to be more easily read in the future.
Let’s have a look at what we’ve got:
Show the code
combined
spacegroup_x | a | b | c | alpha | beta | gamma | volume | h | k | l | nd | dvalue | intensita | type | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C m c m | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.000 | 0.0 | 738.078 | 0 | 0 | 0 | 1 | 0.000000 | 0.000000 | Orthorhombic | 1010866 |
1 | I 4/m m m | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.000 | 0.0 | 388.473 | 0 | 0 | 0 | 1 | 0.000000 | 0.000000 | Tetragonal | 6000382 |
2 | P -1 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.000 | 0.0 | 429.928 | 0 | 0 | 0 | 1 | 0.000000 | 0.000000 | Triclinic | 6000073 |
3 | P -3 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.000 | 0.0 | 1221.297 | 0 | 0 | 0 | 1 | 0.000000 | 0.000000 | Trigonal (hexagonal axes) | 6000152 |
4 | P 1 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.000 | 0.0 | 552.983 | 0 | 0 | 0 | 1 | 0.000000 | 0.000000 | Cubic | 6000430 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
458155 | P 1 21 1 | 6.6330 | 15.4171 | 11.9417 | 90.0 | 103.335 | 90.0 | 1188.257 | 0,0,0,1,0,1,1,1,0,0,1,1,1,1,1,0,1,0,1,1,1,1,1,... | 0,1,2,0,2,0,1,1,0,1,0,2,0,1,2,3,1,2,2,2,3,3,0,... | 1,1,0,0,1,-1,0,-1,2,2,1,0,-2,1,-1,1,-2,2,1,-2,... | 499 | 11.619700,9.279300,7.708500,6.454200,6.423600,... | 1000.000000,62.638200,82.940500,3.929300,0.067... | Monoclinic | 4344045 |
458156 | C 1 2/m 1 | 21.2287 | 17.8117 | 12.3055 | 90.0 | 124.759 | 90.0 | 3822.662 | 1,1,0,2,0,2,0,2,1,3,2,2,1,1,3,3,1,2,4,2,0,4,1,... | 1,1,0,0,2,0,2,2,1,1,2,0,3,1,1,1,3,0,0,2,0,0,3,... | 0,-1,1,-1,0,0,1,-1,1,-1,0,-2,0,-2,0,-2,-1,1,-1... | 498 | 12.461500,10.123800,10.109700,9.999700,8.90580... | 1000.000000,1.459300,384.189200,186.432500,88.... | Monoclinic | 4123491 |
458157 | C 1 2/m 1 | 21.5314 | 18.0366 | 12.3918 | 90.0 | 124.872 | 90.0 | 3948.216 | 1,1,0,2,0,2,0,2,1,3,2,2,1,3,1,3,1,2,4,2,0,4,1,... | 1,1,0,0,2,0,2,2,1,1,2,0,3,1,1,1,3,0,0,2,0,0,3,... | 0,-1,1,-1,0,0,1,-1,1,-1,0,-2,0,0,-2,-2,-1,1,-1... | 498 | 12.620300,10.213500,10.166600,10.122700,9.0183... | 1000.000000,0.425500,401.199700,131.443500,109... | Monoclinic | 4123492 |
458158 | I -4 2 m | 10.2400 | 10.2400 | 9.6520 | 90.0 | 90.000 | 90.0 | 1012.086 | 1,1,2,0,2,1,2,2,3,3,1,2,3,3,2,4,3,0,4,3,4,1,4,... | 1,0,0,0,1,1,2,0,1,0,0,2,2,1,1,0,3,0,1,0,2,1,0,... | 0,1,0,2,1,2,0,2,0,1,3,2,1,2,3,0,0,4,1,3,0,4,2,... | 302 | 7.240800,7.023700,5.120000,4.826000,4.137400,4... | 435.793700,1000.000000,156.773800,74.174400,1.... | Tetragonal | 1508227 |
458159 | P n a 21 | 10.3130 | 8.1940 | 4.9930 | 90.0 | 90.000 | 90.0 | 421.932 | 1,2,2,0,0,1,1,2,2,2,3,1,2,3,1,3,4,0,4,2,0,1,3,... | 1,0,1,1,2,1,2,0,1,2,1,2,2,1,3,2,0,0,1,3,3,3,2,... | 0,0,0,1,0,1,0,1,1,0,0,1,1,1,0,0,0,2,0,0,1,1,1,... | 484 | 6.415500,5.156500,4.364200,4.263800,4.097000,3... | 884.735400,207.638600,175.487800,594.721200,42... | Orthorhombic | 8103382 |
458160 rows × 16 columns
Show the code
combined.describe()
a | b | c | alpha | beta | gamma | volume | nd | id | |
---|---|---|---|---|---|---|---|---|---|
count | 458160.000000 | 458160.000000 | 458160.000000 | 458160.000000 | 458160.000000 | 458160.000000 | 458160.000000 | 458160.000000 | 4.581600e+05 |
mean | 12.505100 | 13.372528 | 16.330272 | 89.503372 | 95.199651 | 90.906179 | 3178.613746 | 454.593930 | 4.604041e+06 |
std | 6.874009 | 6.655914 | 8.092967 | 8.326557 | 11.649541 | 10.924374 | 6384.755145 | 119.026662 | 2.281053e+06 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000e+06 |
25% | 8.282900 | 9.153000 | 11.279975 | 90.000000 | 90.000000 | 90.000000 | 1089.974000 | 498.000000 | 2.221300e+06 |
50% | 10.780000 | 12.130000 | 15.169000 | 90.000000 | 92.430000 | 90.000000 | 1976.458000 | 499.000000 | 4.302406e+06 |
75% | 14.703000 | 16.140000 | 19.860000 | 90.000000 | 101.688000 | 90.000000 | 3562.973750 | 500.000000 | 7.046806e+06 |
max | 189.800000 | 150.000000 | 475.977800 | 150.172000 | 173.895000 | 149.900000 | 740478.188000 | 500.000000 | 9.016730e+06 |
Show the code
import seaborn as sns
import matplotlib.pyplot as plt
= plt.subplots(2, 4, figsize=(18,7))
fig, ax =ax[0][0])
sns.kdeplot(combined.a, ax=ax[0][1])
sns.kdeplot(combined.b, ax=ax[0][2])
sns.kdeplot(combined.c, ax=ax[0][3])
sns.kdeplot(combined.volume, ax=ax[1][0])
sns.kdeplot(combined.alpha, ax=ax[1][1])
sns.kdeplot(combined.beta, ax=ax[1][2])
sns.kdeplot(combined.gamma, ax=ax[1][3])
sns.kdeplot(combined.nd, axfor i in range(2):
for j in range(3):
+1].set_ylabel("")
ax[i][j plt.show()
We’re starting off with 458k samples, but a lot of these won’t be useful for us. I’m choosing to focus on materials with relatively small unit cells, and attempting to generate realistic looking laboratory diffraction data.
We can see that some unit cells are listed as having zero volume, and have a value of 0 for the unit cell lengths and angles. Our goal of modifying unit cells and using them to generate the diffraction data as a form of data augmentation means that these entries are not useful to us, so we’ll need to remove them. We can also see that there are some monstrous unit cells in there, with volumes of >700,000 ų!
To make things manageable for future machine learning projects, we’re going to set limits on the allowable unit cell dimensions and volumes. Let’s set the lower and upper limits for volume to 400 and 4000 ų respectively, and set the maximum unit cell length to 50 Å. We can also see some unusual looking unit cell angles, so I’m going to trim entries with cell angles below 60 and above 120 degrees (keeping 120 degrees is important to ensure we keep hexagonal unit cells).
Eventually, we’re going to generate diffraction data. I’ve chosen to focus on realistic looking laboratory diffraction data, and hence I’m choosing to work in the (arbirarily chosen!) range 4 - 44 ° \(2\theta\) with Cu K-α1 radiation. We’ll want to ensure that there are at least a few easily detectable peaks within this chosen range. I’ll set the lower bound for this as 10.
= 400
min_volume = 4000
max_volume = 50
max_cell_length = 60
min_cell_angle = 120
max_cell_angle = 4
min_data_angle = 44
max_data_angle = 1.54056
wavelength = 10 min_detectable_peaks_in_range
Unit cell dimensions and volume
Let’s first filter by the unit cell restrictions on volume, lengths and angles. We can also filter by the number of d-spacings (in the column nd
) though we will need to come back to this again later to check that the remaining reflections are both in the range desired and also have reasonable intensities.
Show the code
= combined[(combined.volume <= max_volume) & (combined.volume >= min_volume)]
combined = combined[(combined.a != 0) & (combined.b != 0) & (combined.c != 0)]
combined = combined[(combined.a <= max_cell_length) & (combined.b <= max_cell_length) & (combined.c <= max_cell_length)]
combined = combined[(combined.alpha >= min_cell_angle) & (combined.alpha <= max_cell_angle)]
combined = combined[(combined.beta >= min_cell_angle) & (combined.beta <= max_cell_angle)]
combined = combined[(combined.gamma >= min_cell_angle) & (combined.gamma <= max_cell_angle)]
combined = combined[(combined.nd >= min_detectable_peaks_in_range)]
combined combined.describe()
a | b | c | alpha | beta | gamma | volume | nd | id | |
---|---|---|---|---|---|---|---|---|---|
count | 324333.000000 | 324333.000000 | 324333.000000 | 324333.000000 | 324333.000000 | 324333.000000 | 324333.000000 | 324333.000000 | 3.243330e+05 |
mean | 10.992746 | 12.400612 | 15.034673 | 89.719327 | 94.956107 | 90.499324 | 1849.638022 | 477.448866 | 4.549946e+06 |
std | 4.372923 | 4.840500 | 5.626536 | 7.474114 | 9.913211 | 9.476333 | 931.743127 | 81.543710 | 2.265363e+06 |
min | 2.808500 | 2.820000 | 2.718100 | 60.019000 | 60.130000 | 60.000000 | 400.059000 | 16.000000 | 1.000000e+06 |
25% | 8.236400 | 9.237500 | 11.247500 | 90.000000 | 90.000000 | 90.000000 | 1085.857000 | 498.000000 | 2.219163e+06 |
50% | 10.114500 | 11.585500 | 14.365300 | 90.000000 | 93.216000 | 90.000000 | 1699.288000 | 499.000000 | 4.129720e+06 |
75% | 12.609000 | 14.506900 | 17.908600 | 90.000000 | 101.601000 | 90.000000 | 2515.215000 | 500.000000 | 7.047435e+06 |
max | 49.953000 | 49.977000 | 49.944000 | 119.990000 | 120.000000 | 120.000000 | 3999.991000 | 500.000000 | 9.016729e+06 |
That’s a fairly significant reduction in the number of samples - the size of the dataset has gone down by over 100k! However, hopefully we’ve now gotten rid of most of the really odd or unusable samples in the dataset, and the remaining filtering won’t have such a drastic effect on the remaining entries.
Let’s now replot the kernel density plots we saw earlier and see how things look after the filtering.
Show the code
= plt.subplots(2, 4, figsize=(18,7))
fig, ax =ax[0][0])
sns.kdeplot(combined.a, ax=ax[0][1])
sns.kdeplot(combined.b, ax=ax[0][2])
sns.kdeplot(combined.c, ax=ax[0][3])
sns.kdeplot(combined.volume, ax=ax[1][0])
sns.kdeplot(combined.alpha, ax=ax[1][1])
sns.kdeplot(combined.beta, ax=ax[1][2])
sns.kdeplot(combined.gamma, ax=ax[1][3])
sns.kdeplot(combined.nd, ax plt.show()
Things look a bit nicer now! Just out of interest, we’ll take a quick look at the distribution of crystal systems and space groups. Other work looking at the CSD and other databases has shown a significant imbalance in the number of crystal structures adopting each crystal system and indeed space group.
Show the code
type.value_counts().sort_values().iloc[::-1].head(20).plot.bar()
combined. plt.show()
Show the code
-1].head(20).plot.bar()
combined.spacegroup_x.value_counts().sort_values().iloc[:: plt.show()
Unsurprisingly, we see that some space groups are much more highly represented than others. If we decide to try crystal system, space group or extinction symbol classification using this data, we will need to keep this fairly severe class imbalance in mind!
For convenience for future work, it’d be good to convert these labels into their international tables numbers. Let’s write a quick function to do that, getting rid of any entries that have unknown space groups, or space group labels that are otherwise difficult to interpret.
Show the code
from pymatgen.symmetry import groups
import numpy as np
= {}
symbol_to_num for i, s in enumerate(np.unique(combined.spacegroup_x.to_numpy())):
try:
= groups.SpaceGroup(s)
g = g.int_number
symbol_to_num[s] except:
= -1
symbol_to_num[s]
def label_SGs(sg):
return symbol_to_num[sg]
"sg_number"] = combined.spacegroup_x.apply(label_SGs)
combined[= combined[combined.sg_number > 0]
combined combined.sg_number.describe()
count 323694.000000
mean 24.984507
std 44.618451
min 1.000000
25% 2.000000
50% 14.000000
75% 15.000000
max 230.000000
Name: sg_number, dtype: float64
We lost <1000 entries in that step, which is good news!
Diffraction data
We can now turn our attention to the diffraction data - the intensities, Miller indices and d-spacings. Some of the entries will have a huge number of peaks, some will have very few. To start with, I’m choosing to exclude samples that have reflections below the bottom of our data range. I also want to ensure that there is still at least one peak above the top end of our data range which ensures that we aren’t missing any data inside our chosen data range.
Our dataframe contains the d-spacings for the observed reflections, so using Bragg’s law we can easily convert the angular range bounds into d-spacings, and then apply the filtering just described.
Show the code
import numpy as np
= lambda x: wavelength/(2*np.sin(np.deg2rad(x)/2))
d_from_tt = lambda x: 2*np.rad2deg(np.arcsin(wavelength/(2*x)))
tt_from_d = d_from_tt(min_data_angle)
max_d = d_from_tt(max_data_angle)
min_d print("Wavelength:", wavelength)
print(f"Minimum angle = {min_data_angle} : Maximum d-spacing = {max_d:.3f}")
print(f"Maximum angle = {max_data_angle} : Minimum d-spacing = {min_d:.3f}")
Wavelength: 1.54056
Minimum angle = 4 : Maximum d-spacing = 22.071
Maximum angle = 44 : Minimum d-spacing = 2.056
Show the code
def d_spacing_limits(x, min_d_spacing=1, max_d_spacing=100):
= list(filter(None, x.strip().split(",")))
d = [float(i) for i in d]
d if len(d) == 0:
return False
else:
if d[-1] > min_d_spacing:
return False
if d[0] > max_d_spacing:
return False
return True
"d_spacing_filter"] = combined.dvalue.apply(d_spacing_limits, args=(min_d, max_d)) combined[
Show the code
= combined[combined["d_spacing_filter"] == True]
combined combined.describe()
a | b | c | alpha | beta | gamma | volume | nd | id | sg_number | |
---|---|---|---|---|---|---|---|---|---|---|
count | 294047.000000 | 294047.000000 | 294047.000000 | 294047.000000 | 294047.000000 | 294047.000000 | 294047.000000 | 294047.000000 | 2.940470e+05 | 294047.000000 |
mean | 10.880480 | 12.250730 | 14.660952 | 89.810688 | 95.496771 | 90.697710 | 1761.181410 | 475.436522 | 4.502841e+06 | 27.194466 |
std | 4.407073 | 4.930788 | 5.572797 | 6.735834 | 9.633053 | 8.870661 | 911.606182 | 85.121247 | 2.291774e+06 | 46.176416 |
min | 2.808500 | 2.825500 | 2.718100 | 60.019000 | 60.130000 | 60.000000 | 400.059000 | 16.000000 | 1.000000e+06 | 1.000000 |
25% | 8.104550 | 9.061050 | 10.951000 | 90.000000 | 90.000000 | 90.000000 | 1033.725000 | 498.000000 | 2.214880e+06 | 4.000000 |
50% | 9.900700 | 11.252000 | 13.876500 | 90.000000 | 93.612000 | 90.000000 | 1587.395000 | 499.000000 | 4.122961e+06 | 14.000000 |
75% | 12.467950 | 14.239950 | 17.401100 | 90.000000 | 101.996000 | 90.000000 | 2354.773000 | 500.000000 | 7.047926e+06 | 19.000000 |
max | 45.366000 | 48.809000 | 49.494500 | 119.990000 | 120.000000 | 120.000000 | 3999.991000 | 500.000000 | 9.016729e+06 | 230.000000 |
We are down to just shy of 300k samples now, with one last bit of filtering to apply. Lastly, we’ll ensure that we have sufficient easily detectable peaks within our data range. Do to this, we need to look at the intensities:
Show the code
combined.intensita
11 1000.000000,293.781700,0.932900,880.099700,350...
12 344.517500,136.663800,113.543400,41.743900,4.9...
13 480.841100,1000.000000,345.837700,13.887000,72...
14 6.550100,7.094000,99.801200,1000.000000,642.20...
17 2.707700,223.647400,1000.000000,119.345700,113...
...
458148 191.292600,19.724900,41.216500,169.019100,150....
458151 1000.000000,782.379000,202.061600,29.027900,13...
458155 1000.000000,62.638200,82.940500,3.929300,0.067...
458158 435.793700,1000.000000,156.773800,74.174400,1....
458159 884.735400,207.638600,175.487800,594.721200,42...
Name: intensita, Length: 294047, dtype: object
Intensities are stored as a string of comma separated values, with each pattern scaled to a maximum intensity of 1000. Next we should ensure that in our chosen data range, the intensities that are present are actually strong enough to be observed. Let’s say we want the minimum intensity in our data range to be 0.5 % of the maximum intensity. As such, there should have at least \(N\) intensities in our range with a value greater than 5. As discussed earlier, I’ve chosen to set \(N = 10\).
Show the code
def n_peaks_in_range(dspacing, intensity, min_intensity=5):
= list(filter(None, dspacing.strip().split(",")))
dspacing = np.array([float(i) for i in dspacing])
dspacing = list(filter(None, intensity.strip().split(",")))
intensity = np.array([float(i) for i in intensity])
intensity = intensity[dspacing >= min_d]
intensity return (intensity > min_intensity).sum()
"n_detectable_peaks_in_range"] = combined.apply(lambda x: n_peaks_in_range(x.dvalue, x.intensita), axis=1)
combined["n_detectable_peaks_in_range"].describe() combined[
count 294047.000000
mean 113.481430
std 59.305561
min 0.000000
25% 72.000000
50% 107.000000
75% 149.000000
max 386.000000
Name: n_detectable_peaks_in_range, dtype: float64
As we can see, some of the patterns we might generate would have very few or no easily detectable peaks in our chosen data range. As our final filtering step, we’ll get rid of them, leaving us with our final filtered dataset.
Show the code
= combined[combined.n_detectable_peaks_in_range >= min_detectable_peaks_in_range]
combined combined.describe()
a | b | c | alpha | beta | gamma | volume | nd | id | sg_number | n_detectable_peaks_in_range | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 288123.000000 | 288123.000000 | 288123.000000 | 288123.000000 | 288123.000000 | 288123.000000 | 288123.000000 | 288123.000000 | 2.881230e+05 | 288123.000000 | 288123.000000 |
mean | 10.915994 | 12.314144 | 14.769507 | 89.806797 | 95.609645 | 90.698470 | 1780.427861 | 484.062508 | 4.493060e+06 | 23.272384 | 115.697018 |
std | 4.439603 | 4.955456 | 5.568620 | 6.804673 | 9.698929 | 8.939179 | 909.067811 | 60.494039 | 2.259233e+06 | 37.387276 | 57.842270 |
min | 2.808500 | 2.825500 | 2.760000 | 60.019000 | 60.130000 | 60.000000 | 400.059000 | 30.000000 | 1.000000e+06 | 1.000000 | 10.000000 |
25% | 8.099250 | 9.133100 | 11.122850 | 90.000000 | 90.000000 | 90.000000 | 1054.453000 | 498.000000 | 2.216278e+06 | 4.000000 | 75.000000 |
50% | 9.925000 | 11.332000 | 13.984900 | 90.000000 | 93.924000 | 90.000000 | 1608.534000 | 499.000000 | 4.122969e+06 | 14.000000 | 108.000000 |
75% | 12.542100 | 14.337000 | 17.492400 | 90.000000 | 102.190000 | 90.000000 | 2378.520500 | 500.000000 | 7.045938e+06 | 15.000000 | 151.000000 |
max | 45.366000 | 48.809000 | 49.494500 | 119.990000 | 120.000000 | 120.000000 | 3999.991000 | 500.000000 | 9.016729e+06 | 230.000000 | 386.000000 |
Saving the filtered data
Now that we’ve filtered our data to our requirements, the last thing to do is save the relevant information in numpy arrays which can then be quickly and easily loaded from disk when needed.
The arrays to be saved will contain:
- Unit cells
- Crystal systems
- Miller indices
- Peak intensities
- Space groups
- COD IDs in case we want to look up a particular crystal structure
As the number of peaks in each pattern is different, the Miller indices and Peak intensity arrays will need padding up to the maximum number, which is 500.
Show the code
def get_array(x, dtype=np.float32):
= np.zeros(500, dtype=dtype)
data = list(filter(None, x.strip().split(",")))
x = np.array(x, dtype=dtype)
x len(x)] = x
data[:return data
= combined.h.apply(get_array, args=(np.int64,))
h = combined.k.apply(get_array, args=(np.int64,))
k = combined.l.apply(get_array, args=(np.int64,))
l = combined.intensita.apply(get_array, args=(np.float32,))
intensities
= np.vstack(intensities)
intensities = np.vstack(h)
h = np.vstack(k)
k = np.vstack(l)
l = np.dstack([h,k,l])
hkl
= combined[["a","b","c","alpha","beta","gamma"]].to_numpy()
unit_cell = np.unique(combined.type.to_numpy(), return_inverse=True)
crystal_systems, crystal_systems_numeric = combined.sg_number.to_numpy()
space_group_number = combined.id.to_numpy()
cod_id print(crystal_systems)
Now we can save everything, as well as a configuration file to keep track of our settings, which we’ll save using the json
format.
Show the code
import json
= f"{min_data_angle}-{max_data_angle}-CuKa1-data_{max_volume}_"
base_name = {}
config "base_name"] = base_name
config["min_volume"] = min_volume
config["max_volume"] = max_volume
config["max_cell_length"] = max_cell_length
config["min_cell_angle"] = min_cell_angle
config["max_cell_angle"] = max_cell_angle
config["min_data_angle"] = min_data_angle
config["max_data_angle"] = max_data_angle
config["wavelength"] = wavelength
config["min_detectable_peaks_in_range"] = min_detectable_peaks_in_range
config["crystal_systems"] = [x[0]+" = "+str(x[1]) for x in zip(crystal_systems, np.arange(len(crystal_systems)))]
config[
with open("data_config.json", "w") as f:
=4)
json.dump(config, f, indent+"unit_cell.npy", unit_cell)
np.save(base_name+"crystal_systems_numeric.npy", crystal_systems_numeric)
np.save(base_name+"space_group_number.npy", space_group_number)
np.save(base_name+"hkl.npy", hkl)
np.save(base_name+"intensities.npy", intensities)
np.save(base_name+"cod_id.npy", cod_id) np.save(base_name
Generating synthetic data
Now that we’ve filtered down the original PowCod database into the entries that meet the (arbitrary) conditions described earlier, it’s time to start thinking about how we can use the information to generate synthetic PXRD data.
I’ve split this task into:
- Peak positions
- Peak intensities
- Peak shapes
- Noise and background
Peak positions
Rather than using the d-spacings from the database directly, we are going to make use of the unit cells and Miller indices to calculate the d-spacings (and hence angles) for the reflections. This will allow us to add a small amount of noise to the unit cell dimensions, which serves as a form of data augmentation.
The d-spacing of a given reflection can be calculated using this equation:
\(\frac{1}{d^{2}} = X_{hh}h^2 + X_{kk}k^2 + X_{ll}l^2 + X_{hk}hk +X_{hl}hl + X_{kl}kl\)
where \(X_{nn}\) are elements of the reciprocal space metric tensor.
This means that we will need a function to convert the unit cell dimensions into a matrix representation of the unit cell, invert this to determine the reciprocal lattice matrix and from this obtain the reciprocal lattice metric tensor. Once we have the reciprocal lattice metric tensor, we can calculate the d-spacings quite easily by multiplying with the Miller index array(s). However, one complication is that we want to do this for several samples simultaneously, so need to consider the shapes of our arrays.
Before dealing with that, let’s write a function to simulate the effect of temperature changes, by adding a small amount of Gaussian noise to the unit cell dimensions. We will have to consider the crystal system symmetry here in order to ensure that, for example a cubic unit cell, the perturbation to lengths \(b\) and \(c\) is equal to the pertubation to \(a\). The standard deviation of the random noise to be added is going to be set to a default value of 0.05, meaning that the majority of perturbed unit cells should have edges with lengths +/- 0.15 Å from the database values, and angles +/- 0.15 ° (assuming they are allowed to vary by symmetry). This seems reasonable, and should allow us to assume that the fractional coordinates within the unit cells are unaffected, so we don’t need to worry about the effect on the relative intensities.
Show the code
import torch
def get_unit_cell_perturbation(crystal_systems, dtype=torch.float32, stddev=0.05):
"""Generate a perturbation for the unit cell lengths and angles for a given
crystal system.
Args:
crystal_systems (tensor): Crystal systems for the unit cells.
Shape = (batch)
dtype (torch.dtype, optional): Defaults to torch.float32.
stddev (float, optional): Standard deviation for the gaussian noise.
Defaults to 0.05.
Returns:
tensor: A tensors to additively adjust the unit cell lengths and angles
Shape = (batch, 6)
"""
= crystal_systems.shape[0]
batchsize = crystal_systems.device
device = torch.randn((2, batchsize, 3), device=device, dtype=dtype) * stddev
lengths, angles = crystal_systems == 0
cubic = crystal_systems == 1
hexagonal = crystal_systems == 2
monoclinic #orthorhombic = crystal_systems == 3 # Don't need to query data for this
= crystal_systems == 4
tetragonal = crystal_systems == 5
triclinic = crystal_systems == 6
trigonal_h = crystal_systems == 7
trigonal_r
# Cubic, tetragonal, rhombohedral and hexagonal - a and b must be the same
1] = torch.where(cubic | hexagonal | tetragonal | trigonal_h | trigonal_r,
lengths[:,0], lengths[:,1])
lengths[:,# Cubic and rhombohedral cells - a, b and c must be the same
2] = torch.where(cubic | trigonal_r, lengths[:,0], lengths[:,2])
lengths[:,# Rhombohedral and triclinic cells - could change their alpha values
0] = torch.where((trigonal_r | triclinic), angles[:,0], 0.)
angles[:,# Triclinic or monoclinic cells could change beta values
1] = torch.where((triclinic | monoclinic), angles[:,1], 0.)
angles[:,# Triclinc cells could change gamma
2] = torch.where(triclinic, angles[:,2], 0.)
angles[:,# Rhombohedral cells - need to ensure all angles are the same
1] = torch.where(trigonal_r, angles[:,0], angles[:,1])
angles[:,2] = torch.where(trigonal_r, angles[:,0], angles[:,2])
angles[:,
return torch.concat([lengths, angles], dim=-1)
Let’s check it does what it’s meant to!
What we’re expecting is that for the unit cell lengths, the perturbation should be the same for each of the edges for cubic and rhombohedral cells, and the perturbation for \(a\) and \(b\) should be the same for hexagonal and tetragonal cells.
With regards to angles, where the original unit cell is an angle of 90 or 120 degrees, the perturbation should be zero. So we should see a change in the beta value for monoclinic cells, a change for all values of the triclinic unit cells and the same change for each of the angles for rhombohedral cells.
Show the code
= {
cs_key "Cubic" : 0,
"Hexagonal" : 1,
"Monoclinic" : 2,
"Orthorhombic" : 3,
"Tetragonal" : 4,
"Triclinic" : 5,
"Trigonal (hexagonal axes)" : 6,
"Trigonal (rhombohedral axes)" : 7
}
for c in cs_key.items():
= c
name, number = get_unit_cell_perturbation(torch.tensor([[int(number)]]))
perturb = perturb[:,:3]
lengths = perturb[:,3:]
angles print(name, "Length perturbation:", lengths)
print(name, "Angle perturbation:", angles,"\n")
Cubic Length perturbation: tensor([[0.0120, 0.0120, 0.0120]])
Cubic Angle perturbation: tensor([[0., 0., 0.]])
Hexagonal Length perturbation: tensor([[0.0535, 0.0535, 0.0225]])
Hexagonal Angle perturbation: tensor([[0., 0., 0.]])
Monoclinic Length perturbation: tensor([[-0.0509, 0.0907, -0.0980]])
Monoclinic Angle perturbation: tensor([[0.0000, 0.0176, 0.0000]])
Orthorhombic Length perturbation: tensor([[-0.0788, -0.0403, -0.0860]])
Orthorhombic Angle perturbation: tensor([[0., 0., 0.]])
Tetragonal Length perturbation: tensor([[-0.0156, -0.0156, -0.0144]])
Tetragonal Angle perturbation: tensor([[0., 0., 0.]])
Triclinic Length perturbation: tensor([[-0.0154, -0.0387, 0.0159]])
Triclinic Angle perturbation: tensor([[ 0.0273, -0.0582, -0.0836]])
Trigonal (hexagonal axes) Length perturbation: tensor([[-0.0063, -0.0063, 0.0845]])
Trigonal (hexagonal axes) Angle perturbation: tensor([[0., 0., 0.]])
Trigonal (rhombohedral axes) Length perturbation: tensor([[-0.0002, -0.0002, -0.0002]])
Trigonal (rhombohedral axes) Angle perturbation: tensor([[-0.0607, -0.0607, -0.0607]])
Looks good to me. We now need to start writing functions to generate our lattice matrix, reciprocal lattice matrix and reciprocal lattice metric tensor. We can then use these in conjunction with the Miller indices to determine the reflection d-spacings.
Given that we are modifying unit cells, it’s probably sensible to check that the perturbed unit cells are valid. We can do that by calculating the volume of the unit cells, with invalid unit cells returning a volume of zero.
Show the code
def get_lattice_matrix(unit_cell_dimensions):
"""calculate a lattice matrix from unit cell dimensions
Args:
unit_cell_dimensions (tensor): The unit cell dimensions. Lengths in
angstroms, angles in degrees.
Shape = (batch, 6)
Returns:
tensor: The matrix representation of the unit cells.
Shape = (batch, 3, 3)
"""
=0.017453292519943295
pi_over_180= unit_cell_dimensions[:,:3].T
a, b, c = torch.cos(unit_cell_dimensions[:,3:]*pi_over_180).T
cosal, cosbe, cosga = torch.sin(unit_cell_dimensions[:,3:-1]*pi_over_180).T
sinal, sinbe # Sometimes rounding errors cause |values| slightly > 1.
= torch.clamp((cosal * cosbe - cosga) / (sinal * sinbe), min=-1, max=1)
val
= torch.arccos(val)
gamma_star = torch.zeros_like(a)
zeros = torch.stack([a * sinbe, zeros, a*cosbe]).T
v_a = torch.stack([-b * sinal * torch.cos(gamma_star),
v_b *sinal * torch.sin(gamma_star),
b*cosal]).T
b= torch.stack([zeros, zeros, c]).T
v_c
= torch.stack([v_a,v_b,v_c], dim=2)
matrix
# Unit cells are valid if cell volume > 0
# The cell volume is |det(M)|, but don't need the absolute value here
= torch.linalg.det(matrix)
volume = volume != 0
valid
return matrix, valid
Let’s quickly check that this works as expected. I’ll give it a simple cubic unit cell, and an invalid unit cell.
Show the code
= torch.tensor([5.0, 5.0, 5.0, 90.0, 90.0, 90.0]).unsqueeze(0)
cubic_cell = torch.tensor([5.0, 6.0, 7.0, 120.0, 120.0, 120.0]).unsqueeze(0)
invalid_cell print("Cubic cell valid -",get_lattice_matrix(cubic_cell)[1].item())
print("Invalid cell valid -", get_lattice_matrix(invalid_cell)[1].item())
Cubic cell valid - True
Invalid cell valid - False
The last couple of functions we need to implement then include a way to get the reciprocal lattice metric tensor. The reciprocal lattice matrix can be easily obtained by inverting the lattice matrix we calculated previously.
We also need a way to obtain the reflection d-spacings, and then lastly convert the d-spacings to \(2\theta\).
Show the code
def get_recip_lattice_metric_tensor(recip_lattice_matrix):
"""Calculate the reciprocal lattice metric tensor
Args:
recip_lattice_matrix (tensor): Reciprocal lattice matrix
Shape = (batch, 3, 3)
Returns:
tensor: Reciprocal lattice metric tensor
Shape = (batch, 3, 3)
"""
return recip_lattice_matrix @ recip_lattice_matrix.permute(0,2,1)
def get_d_spacing(recip_latt_metric_tensor,hkl):
"""Calculate the d-spacings for the reflections from the Miller indices and
the reciprocal lattice metric tensor
Args:
recip_latt_metric_tensor (tensor): Reciprocal lattice metric tensor
Shape = (batch, 3, 3)
hkl (tensor): Miller indices
Shape = (batch, number of reflections, 3)
Returns:
tensor: d-spacing for each of the reflections
Shape = (batch, number of reflections)
"""
= torch.einsum("bij,bji->bi",hkl,torch.einsum(
one_over_d_squared "bij,bkj->bik",recip_latt_metric_tensor,hkl))
= 1/torch.sqrt(one_over_d_squared)
d return d
def d_to_tt(d,wavelength=1.54056):
"""Convert d-spacings to twotheta values (in degrees)
Args:
d (tensor): d-spacings for each of the reflections
Shape = (batch, number of reflections)
wavelength (float): The wavelength of the radiation. Defaults to
1.54056 which is copper K-a1
Returns:
tensor: twotheta values for each of the reflections
Shape = (batch, number of reflections)
"""
= 114.59155902616465
two_times_180_over_pi = two_times_180_over_pi*torch.arcsin(wavelength/(2*d))
tt return tt
def get_zero_point_error(batchsize, device, dtype, zpemin=0.03, zpemax=0.03):
"""
Generate a random zero-point error to be applied to the peak positions
Args:
batchsize (int): Size of the batch dimension
device (torch.device): Device to generate the tensor on
dtype (torch.dtype): dtype to use for the ZPEs
zpemin (float, optional): Lower bound for zero point error in degrees
zpemax (float, optional): Upper bound for zero point error in degrees
Returns:
tensor: Zero point error to be applied to the peak positions
Shape = (batch, 1)
"""
= (torch.rand((batchsize,1), device=device, dtype=dtype)
zero_point_error * (zpemax - zpemin)) + zpemin
return zero_point_error
To test the whole set of functions written so far, I’ll add a convenience function to calculate the peak positions from the unit cells and Miller indices.
Show the code
def get_peak_positions(unit_cells, hkl, zpe=True, wavelength=1.54056):
= get_lattice_matrix(unit_cells)
lattice_matrix, valid = unit_cells[valid]
unit_cells = hkl[valid]
hkl = torch.linalg.inv(lattice_matrix)
reciprocal_lattice_matrix = get_recip_lattice_metric_tensor(reciprocal_lattice_matrix)
reciprocal_lattice_metric_tensor = get_d_spacing(reciprocal_lattice_metric_tensor, hkl)
d_spacing = d_to_tt(d_spacing, wavelength)
twotheta if zpe:
= get_zero_point_error(twotheta.shape[0],
zero_point_error
twotheta.device,
twotheta.dtype)+= zero_point_error
twotheta return twotheta
I’ll test that using some diffraction data for a Carbamazepine:Indomethacin cocrystal I published a while ago. The first 3 peaks have the following Miller indices and positions:
\(h\) | \(k\) | \(l\) | \(2\theta\) |
---|---|---|---|
1 | 1 | 0 | 5.322 |
2 | 0 | 0 | 7.496 |
0 | 2 | 0 | 7.562 |
And the unit cell is:
\(a\) = 23.573194, \(b\) = 23.363375, \(c\) = 5.125218
\(\alpha\) = 90.000, \(\beta\) = 88.77132, \(\gamma\) = 90.000
We’ll run the function both with and without zero point errors applied.
Show the code
= torch.tensor([23.573194, 23.363375, 5.125218, 90.000, 88.77132, 90.000])
coXtal_unit_cell = torch.tensor([[1,1,0],[2,0,0],[0,2,0]]).float()
coXtal_hkl
for zpe in [False, True]:
print("ZPE:",zpe)
= get_peak_positions(coXtal_unit_cell.reshape(1,6), coXtal_hkl.reshape(1,3,3), zpe=zpe)
coXtal_twotheta print("[h, k, l] 2θ")
print("----------------")
for i, ht in enumerate(zip(coXtal_hkl, coXtal_twotheta.squeeze())):
print(ht[0].int().tolist(), f"{ht[1].item():.3f}")
print("")
ZPE: False
[h, k, l] 2θ
----------------
[1, 1, 0] 5.322
[2, 0, 0] 7.496
[0, 2, 0] 7.562
ZPE: True
[h, k, l] 2θ
----------------
[1, 1, 0] 5.352
[2, 0, 0] 7.526
[0, 2, 0] 7.592
That looks like it’s working nicely - without a zero point error, the peak positions match, whilst with an error, the peak positions are all shifted by a consistent amount.
Peak intensities
The main contribution to the intensity of a diffraction peak is the position of the atoms within the unit cell. What makes the PowCod database so helpful is that this has already been calculated, and can be read in from the intensities array we saved earlier.
However, there are experimental/physical effects that can modulate these intensities. Most import is the effect of preferred orientation. There are several ways to model this, I’m choosing to implement the March-Dollase method as I already have some code written to do this in GALLOP. This code is actually based on code from GSAS-II; see the GetPrefOri
function in the file found here. I updated it to use pytorch, and work over a batch of multiple PXRD patterns.
Again, this is being implemented as a form of data augmentation, allowing our existing data to generate multiple different, but plausible, diffraction patterns. As such, the preferred orientation axis is limited to being in the [1,0,0], [0,1,0] or [0,0,1] directions.
Future work could extend this to use the more sophisticated spherical harmonic approach described here.
Show the code
def get_MD_PO_components(hkl, reciprocal_lattice_metric_tensor, dspacing, factor_std=0.1):
"""Calculate the terms needed in the March-Dollase preferred orientation
correction.
Args:
hkl (tensor): The Miller indices of the reflections.
Shape = (batch, number of reflections, 3)
reciprocal_lattice_matrix (tensor): The reciprocal lattice matrix
Shape = (batch, 3, 3)
factor_std (float, optional): The standard deviation for the normally
distributed March-Dollase factors.
Defaults to 0.1.
Returns:
tuple: Tuple of tensors containing the terms needed in the March-Dollase
preferred orientation correction function. cosP, sinP = cosine and sine
of the angle between the Miller indices and the PO axis. Factor = the
March Dollase factors, PO_axis = the preferred orientation axis.
"""
= hkl.shape[0]
batchsize = hkl.device
device = hkl.dtype
dtype # Randomly assign the PO axis to be either [1,0,0], [0,1,0] or [0,0,1]
= torch.zeros((batchsize, 3), device=device, dtype=dtype)
PO_axis = torch.randint(0,3,(batchsize,),device=device)
PO_axis_select =device),PO_axis_select] = 1.0
PO_axis[torch.arange(batchsize, device
"""u = hkl / torch.sqrt(torch.einsum("bkj,bkj->bk", hkl,
torch.einsum("bij,bkj->bki",
reciprocal_lattice_matrix,hkl))).unsqueeze(2)"""
# Dividing the Miller indices by the reciprocal lattice vector lengths is
# equivalent to multiplying them by the d-spacings
= hkl * dspacing.unsqueeze(2)
u
= torch.einsum("bij,bj->bi", u, torch.einsum("bij,bj->bi",
cosP
reciprocal_lattice_metric_tensor, PO_axis))= 1.0-cosP**2
one_minus_cosP_sqd < 0.] *= 0.
one_minus_cosP_sqd[one_minus_cosP_sqd = torch.sqrt(one_minus_cosP_sqd)
sinP
# MD factor = 1 means no PO. Use a normal distribution with std given in the
# argument centred at 1.
= 1 + (torch.randn((batchsize,1),device=device,dtype=dtype) * factor_std)
factor
return cosP, sinP, factor, PO_axis
def apply_MD_PO_correction(intensities, cosP, sinP, factor):
"""Modifies the intensities to account for preferred orientation effects
using the method of March and Dollase.
Args:
intensities (tensor): Original intensities for the reflections.
Shape = (batch, number of reflections)
cosP (tensor): Cosine of the angle between the Miller indices and the
preferred orientation axis. Calculated in get_MD_PO_components
Shape = (batch, number of reflections)
sinP (tensor): Sine of the angle between the Miller indices and the
preferred orientation axis. Calculated in get_MD_PO_components
Shape = (batch, number of reflections)
factor (tensor): The March-Dollase factors. Shape = (batch,1)
Returns:
tensor: Modified intensities for the reflections given preferred
orientation. Shape = (batch, number of reflections)
"""
= (1.0/torch.sqrt(((factor)*cosP)**2+sinP**2/(factor)))**3
A_all return intensities * A_all
Peak shapes
Now that we can calculate where the peaks will be, and modulate the intensity of the peaks if we want to, we now need to think about the shape of the peaks.
There are several possible peak shapes that we could go for, however, in order to try to get relatively realistic results, I decided to implement full Voigt functions, and also include the Finger, Cox and Jephcoat correction for peak asymmetry due to axial divergence. I also decided to implement peak broadening as a function of the peak position, using the well established U,V,W,X,Y,Z parameters described here and elsewhere.
We therefore need to calculate gaussian, lorentzian and the FCJ profile for each peak in the data, then convolve them together. Using the convolution theorem, we can do this by first Fourier transforming them, calculating their pointwise product in Fourier space, then inverse Fourier transforming the resultant product.
Show the code
def gaussian(x, mu, sig):
"""Calculate a gaussian peak
Args:
x (tensor): the x-coordinates for the peak. Shape = (datapoints)
mu (tensor): the mean of the gaussian.
Shape = (batch, number of peaks, 1)
sig (tensor): the standard deviation of the gaussian.
Shape = (batch, number of peaks, 1)
Returns:
tensor: the gaussian peaks centered at mu with standard deviation sig.
Intensities scaled to max = 1.
Shape = (batch, number of peaks, datapoints)
"""
= 2.5066282746310002
root_two_pi = (1/(sig*root_two_pi))*torch.exp(-0.5*(x-mu)**2/sig**2)
peak return peak/peak.max(dim=-1).values.unsqueeze(2)
def lorentzian(x, loc, gam):
"""Calcualte a lorentzian peak
Args:
x (tensor): the x-coordinates for the peak. Shape = (datapoints)
loc (tensor): The centre of the peaks.
Shape = (batch, number of peaks, 1)
gam (tensor): The half width at half max for the peaks.
Shape = (batch, number of peaks, 1)
Returns:
tensor: the gaussian peaks centered at loc with HWHM gam.
Intensities scaled to max = 1.
Shape = (batch, number of peaks, datapoints)
"""
= 0.3183098861837907
one_over_pi = one_over_pi*(gam/((x - loc)**2+gam**2))
peak return peak/peak.max(dim=-1).values.unsqueeze(2)
def fcj(data,twotheta,shl):
"""Finger, Cox and Jephcoat profile function.
Code based on GSAS-II function found here:
https://gsas-ii.readthedocs.io/en/latest/_modules/GSASIIpwd.html#fcjde_gen
Args:
data (tensor): the data to evaluate the function on.
Shape = (batch, number of peaks, datapoints)
twotheta (tensor): The position of the peak.
Shape = (batch, number of peaks, 1)
shl (tensor): shl = sum(S/L,H/L) where:
S: sample height
H: detector opening
L: sample to detector opening distance
Shape = (batch, 1)
Returns:
tensor: The Finger, Cox and Jephcoat profiles for the peaks
Shape = (batch, number of peaks, datapoints)
"""
= 0.017453292519943295
pi_over_180 = data[1] - data[0]
step = step*data+twotheta
T = torch.abs(torch.cos(T*pi_over_180))
abs_cos_T = abs_cos_T**2
abs_cos_T_sqd = torch.cos(twotheta*pi_over_180)**2
cos_sqd_twotheta = torch.where(abs_cos_T_sqd>cos_sqd_twotheta,
cos_sqd_twotheta
cos_sqd_twotheta,abs_cos_T_sqd)= torch.where(abs_cos_T_sqd>cos_sqd_twotheta,
fcj_profile /(abs_cos_T_sqd-cos_sqd_twotheta+1e-9))
(torch.sqrt(cos_sqd_twotheta-1./shl)/abs_cos_T,0.0)
= torch.where(fcj_profile > 0.,fcj_profile,0.)
fcj_profile # Sometimes the FCJ profile returned is all zeros. We don't want to separate
# these out with loops etc, so need to deal with them. We'll replace zero entry
# with an array like this ([1, 0, 0, ..., 0]) which gives ([1, 1, 1, ..., 1])
# after the Fourier transform. Then then means that the pointwise multiplication
# of the FT Gaussian and Lorentzian components can still occur unaffected
= (fcj_profile.sum(dim=-1) == 0).type(fcj_profile.dtype)
zero_sum 0] += zero_sum
fcj_profile[:,:,#first_zero = torch.zeros_like(x)
#first_zero[0] = 1
#batch_first_zero = (zero_sum * first_zero.unsqueeze(0))
#peak_FCJ += batch_first_zero
return fcj_profile
def get_UVWZ(batchsize, device, dtype, U_min=0.0001, U_max=0.0004,
=0.0001, V_max=0.0004, W_min=0.0001, W_max=0.0004,
V_min=0.0001, Z_max=0.0004):
Z_min"""
Get parameters for Gaussian HWHM. Defaults should give reasonable data to
resemble laboratory diffraction data
"""
= ((torch.rand(batchsize, device=device, dtype=dtype) * (U_max-U_min))
U + U_min).unsqueeze(1)
= ((torch.rand(batchsize, device=device, dtype=dtype) * (V_max-V_min))
V + V_min).unsqueeze(1)
= ((torch.rand(batchsize, device=device, dtype=dtype) * (W_max-W_min))
W + W_min).unsqueeze(1)
= ((torch.rand(batchsize, device=device, dtype=dtype) * (Z_max-Z_min))
Z + Z_min).unsqueeze(1)
return U, V, W, Z
def get_XY(batchsize, device, dtype, X_min=0.001, X_max=0.035,
=0.001,Y_max=0.035):
Y_min"""
Get parameters for Lorentzian HWHM. Defaults should give reasonable data to
resemble laboratory diffraction data
"""
= ((torch.rand(batchsize, device=device, dtype=dtype) * (X_max-X_min))
X + X_min).unsqueeze(1)
= ((torch.rand(batchsize, device=device, dtype=dtype) * (Y_max-Y_min))
Y + Y_min).unsqueeze(1)
return X, Y
def get_hwhm_G(tan_twotheta, cos_twotheta, U, V, W, Z):
"""Calculate Gaussian HWHM as a function of peak position and U,V,W and Z params
Args:
tan_twotheta (tensor): tangent of the twotheta peak positions
cos_twotheta (tensor): tangent of the twotheta peak positions
U (tensor): peakshape parameter U
V (tensor): peakshape parameter V
W (tensor): peakshape parameter W
Z (tensor): peakshape parameter Z
Returns:
tensor: HWHM for the gaussian peaks
Shape = (batch, number of peaks)
"""
= tan_twotheta.squeeze()
tan_twotheta = cos_twotheta.squeeze()
cos_twotheta return torch.sqrt((U * tan_twotheta**2) + (V * tan_twotheta)
+ W + (Z/(cos_twotheta**2))).unsqueeze(2)
def get_hwhm_L(tan_twotheta, cos_twotheta, X, Y):
"""Calculate Lorentzian HWHM as a function of peak position and X and Y params
Args:
tan_twotheta (tensor): tangent of the twotheta peak positions
cos_twotheta (tensor): tangent of the twotheta peak positions
X (tensor): peakshape parameter X
Y (tensor): peakshape parameter Y
Returns:
tensor: HWHM for the lorentzian peaks
Shape = (batch, number of peaks)
"""
= tan_twotheta.squeeze()
tan_twotheta = cos_twotheta.squeeze()
cos_twotheta return ((X * tan_twotheta) + (Y/cos_twotheta)).unsqueeze(2)
def get_shl(batchsize, device, dtype, shlmax=0.5, rescale=True):
"""Generate asymmetry parameter for the FCJ profile.
Args:
shl = sum(S/L,H/L) where:
S: sample height
H: detector opening
L: sample to detector opening distance
This is scaled by 1/57.2958 if rescale is True, in keeping with the scaling
applied in the original GSAS-II code
Returns:
tensor: The asymmetry parameter for the FCJ profiles
"""
= torch.rand((batchsize, 1, 1), device=device, dtype=dtype) * shlmax
shl /= 57.2958
shl return shl
Background and noise
The experimental noise can be modelled as gaussian noise, which we can add very easily to the calculated diffraction data. For the background shape, we can use Chebyshev polynomials.
Show the code
def get_noise(calculated_patterns, noise_min = 0.0001, noise_max = 0.0025):
"""Get noise for the diffraction patterns to simulate experimental data
Args:
calculated_patterns (tensor): The diffraction patterns.
Shape = (batch, points in profile)
noise_min (float, optional): Minimum standard deviation for the
Gaussian noise. Defaults to 0.0001.
noise_max (float, optional): Maximum standard deviation for the
Gaussian noise. Defaults to 0.0025.
Returns:
tensor: Noise to be added to the diffraction patterns
"""
= calculated_patterns.shape[0]
batchsize = calculated_patterns.device
device = calculated_patterns.dtype
dtype = torch.rand((batchsize,1), device=device, dtype=dtype) * (noise_max - noise_min) + noise_min
noise_std = torch.randn(calculated_patterns.shape, device=device, dtype=dtype) * noise_std
noise return noise
def get_background(batchsize, data, bg_prm_max=0.025, bg_prm_min=0.0, degree=8):
"""Calculate background profiles using Chebyshev polynomials
Args:
batchsize (int): The batch size
data (tensor): The twotheta values for the diffraction histograms
bg_prm_max (float, optional): Maximum value for the standard deviation
of the weights for the Chebyshev polynomial components. Defaults to
0.025.
bg_prm_min (float, optional): Minimum value for the standard deviation
of the weights for the Chebyshev polynomial components. Defaults to
0.0.
degree (int, optional): The degree of Chebyshev polynomial to use for
the backgrounds. Defaults to 8.
Returns:
tensor: Background profiles. Shape = (batch, number of points in histogram)
"""
= data.device
device = data.dtype
dtype = torch.arange(degree,device=device,dtype=dtype).unsqueeze(1)
n # Scale data into range -1 to +1
= 2*(data - data.min())/(data.max() - data.min()) - 1
ttstar = torch.cos(n*torch.arccos(ttstar))
chebyshev_basis = (((torch.rand((batchsize,1,1), device=device, dtype=dtype)
params * (bg_prm_max - bg_prm_min)) + bg_prm_min)
* torch.randn((batchsize, chebyshev_basis.shape[0], 1),
=device, dtype=dtype))
device= (params * chebyshev_basis).sum(dim=1)
bg return bg
A quick test to see if these functions work and generate sensible looking noise and backgrounds:
Show the code
import matplotlib.pyplot as plt
= torch.linspace(4,44,2048).float()
data
= 4
batchsize = get_background(batchsize, data, degree=10)
bgs -= bgs.min(dim=-1).values.unsqueeze(1)
bgs
= get_noise(bgs)
noise
= plt.subplots(1,1,figsize=(10,6))
fig, ax for bg in (bgs+noise):
ax.plot(data, bg)"$2\\theta$")
ax.set_xlabel("Randomly generated backgrounds with noise")
ax.set_title( plt.show()
Some funky background shapes are possible with randomly generated Chebyshev polynomials! We can also see varying levels of noise being added to the profiles.
We now need functions to tie everything together and start to generate some data
Show the code
def get_peak_positions(crystal_systems, hkl, intensities, unit_cells,
=0.05, zpemin=0.03, zpemax=0.03, wavelength=1.54056):
perturbation_stddev= intensities.shape[0]
batchsize = intensities.dtype
dtype = intensities.device
device
= get_unit_cell_perturbation(crystal_systems,
cell_perturbation =dtype, stddev=perturbation_stddev)
dtype= unit_cells + cell_perturbation
new_unit_cells = get_lattice_matrix(new_unit_cells)
lattice_matrix, valid
# Get rid of any invalid unit cells after perturbation
if valid.sum() != valid.shape[0]:
import warnings
"Invalid cells generated")
warnings.warn(= lattice_matrix[valid]
lattice_matrix = hkl[valid]
hkl = intensities[valid]
intensities = intensities.shape[0]
batchsize = torch.linalg.inv(lattice_matrix)
reciprocal_lattice_matrix = get_recip_lattice_metric_tensor(reciprocal_lattice_matrix)
reciprocal_lattice_metric_tensor = get_d_spacing(reciprocal_lattice_metric_tensor, hkl)
d_spacing = get_zero_point_error(batchsize, device, dtype, zpemin=zpemin, zpemax=zpemax)
zpe = zpe + d_to_tt(d_spacing, wavelength)
twotheta
return twotheta, reciprocal_lattice_metric_tensor, hkl, intensities, d_spacing
def get_PO_intensities(hkl, reciprocal_lattice_metric_tensor, dspacing, intensities, PO_std=0.1):
# Now apply PO perturbation to the peak intensities
= get_MD_PO_components(hkl,
cosP, sinP, MDfactor, PO_axis =PO_std)
reciprocal_lattice_metric_tensor, dspacing, factor_std= apply_MD_PO_correction(intensities, cosP, sinP, MDfactor)
intensities return torch.nan_to_num(intensities)
def get_peak_shape_params(twotheta, U_min=0.0001, U_max=0.0004,
=0.0001, V_max=0.0004, W_min=0.0001, W_max=0.0004,
V_min=0.0001, Z_max=0.0004, X_min=0.001, X_max=0.035,
Z_min=0.001, Y_max=0.035, shlmax=0.5):
Y_min= twotheta.shape[0]
batchsize = twotheta.dtype
dtype = twotheta.device
device = torch.tan(twotheta*torch.pi/180.)
tan_twotheta = torch.cos(twotheta*torch.pi/180.)
cos_twotheta = get_UVWZ(batchsize, device, dtype, U_min=U_min, U_max=U_max,
U, V, W, Z =V_min, V_max=V_max, W_min=W_min, W_max=W_max,
V_min=Z_min, Z_max=Z_max)
Z_min= get_XY(batchsize, device, dtype, X_min=X_min, X_max=X_max,
X, Y =Y_min,Y_max=Y_max)
Y_min= get_hwhm_G(tan_twotheta, cos_twotheta, U, V, W, Z)
hwhm_gaussian = get_hwhm_L(tan_twotheta, cos_twotheta, X, Y)
hwhm_lorentzian = get_shl(batchsize, device, dtype, shlmax=shlmax, rescale=True)
shl return hwhm_gaussian, hwhm_lorentzian, shl
def calculate_peaks(x, twotheta, intensities, hwhm_gaussian, hwhm_lorentzian, shl):
= gaussian(x, torch.zeros_like(twotheta), hwhm_gaussian)
peak_G = lorentzian(x, torch.zeros_like(twotheta), hwhm_lorentzian)
peak_L = fcj(x,twotheta, shl)
peak_FCJ = torch.stack([peak_G,peak_L,peak_FCJ],dim=1)
peak_GLF = torch.fft.fft(peak_GLF).prod(dim=1)
prod_FT_GLF = torch.fft.ifft(prod_FT_GLF).real
peak_voigt = peak_FCJ.sum(dim=-1) == 1
zero_sum = torch.fft.ifftshift(peak_voigt[zero_sum.squeeze()], dim=-1)
peak_voigt[zero_sum.squeeze()] /= peak_voigt.max(dim=2).values.unsqueeze(2)
peak_voigt *= intensities
peak_voigt return peak_voigt
def calculate_full_patterns(x, full_data, twotheta, peak_voigt, ttmin=4., ttmax=44.):
# Finally calculate the full diffraction pattern
== 0] = torch.inf
twotheta[twotheta < 4] = torch.inf
twotheta[twotheta > 44] = torch.inf
twotheta[twotheta = torch.abs((x[0] + twotheta) - full_data).min(dim=-1).indices
peakidx = torch.zeros(list(peakidx.shape)+[full_data.shape[0]], device=device, dtype=dtype)
full_pattern = full_pattern.scatter_(2,
full_pattern 2) + torch.arange(x.shape[0], device=device), peak_voigt*torch.isfinite(twotheta))
peakidx.unsqueeze(
= full_pattern.sum(dim=1)
full_pattern /= full_pattern.max(dim=1).values.unsqueeze(1)
full_pattern = full_pattern[:,(full_data >= ttmin) & (full_data <= ttmax)]
full_pattern return full_pattern
def calculate_diffraction_patterns(x, full_data, crystal_systems, hkl, intensities, unit_cells, wavelength=1.54056):
"""
Expect the input tensors to have their first dimension to be of size batchsize
"""
= get_peak_positions(crystal_systems, hkl, intensities, unit_cells,
twotheta, reciprocal_lattice_metric_tensor, hkl, intensities, d_spacing =0.05, zpemin=0.03, zpemax=0.03, wavelength=wavelength)
perturbation_stddev
= twotheta.unsqueeze(2)
twotheta
= get_PO_intensities(hkl, reciprocal_lattice_metric_tensor, d_spacing, intensities).unsqueeze(2)
mod_intensities
= get_peak_shape_params(twotheta)
hwhm_gaussian, hwhm_lorentzian, shl
= calculate_peaks(x, twotheta, mod_intensities, hwhm_gaussian, hwhm_lorentzian, shl)
peak_voigt
= calculate_full_patterns(x, full_data, twotheta, peak_voigt, ttmin=ttmin, ttmax=ttmax)
calculated_patterns
= get_background(calculated_patterns.shape[0],
bgs >= ttmin) & (full_data <= ttmax)],
full_data[(full_data =10)
degree= get_noise(calculated_patterns)
noise
+= bgs + noise
calculated_patterns -= calculated_patterns.min(dim=1).values.unsqueeze(1)
calculated_patterns /= calculated_patterns.max(dim=1).values.unsqueeze(1)
calculated_patterns
return calculated_patterns
Let’s test the functions we’ve written using some real diffraction data as a comparison point. We’ll first plot the actual diffraction pattern, then plot several simulated datasets calculated from the unit cell, hkls and intensities extracted by DASH. This will show the effect of the data augmentation we’ve added into the code which modulates the positions, intensities and shapes of the peaks.
The dataset we’ll look at is that for verapamil hydrochloride. We’ll need a function to calculate the Lorentz-polarisation factor for the intensities, as this would have been corrected for in the intensities in the DASH hcv.
Show the code
= []
v_xye with open("./files/Verap.xye", "r") as f:
for line in f:
= list(filter(None, line.strip().split(" ")))
line float(x) for x in line])
v_xye.append([= np.array(v_xye)
v_xye = v_xye[:,0], v_xye[:,1]
v_tt, v_i_obs /= v_i_obs.max() v_i_obs
Show the code
def lp_factor(twotheta, ttmono=torch.tensor([28])):
= torch.cos(ttmono*torch.pi/180)**2
A = torch.cos(twotheta*torch.pi/180)**2
cossqdtt = torch.sin(twotheta*torch.pi/180)
sintt return (1+A*cossqdtt) / ((1+A)*sintt)
Show the code
import matplotlib.pyplot as plt
= torch.device("cpu")
device = torch.float32
dtype = 4
ttmin = 44
ttmax = 3.
peakrange = 2048
datadim = torch.linspace(ttmin-(peakrange/2), ttmax+(peakrange/2), int(np.ceil((ttmax-ttmin+peakrange)/((ttmax-ttmin)/datadim))), device=device, dtype=dtype)
full_data = full_data[(full_data >= ttmin) & (full_data <= ttmax)].cpu()
plotdata = (full_data[full_data <= ttmin+(peakrange/2)]).clone() - 4
x
= []
v_hkl = []
v_i with open("./files/Verap.hcv", "r") as f:
for line in f:
= list(filter(None, line.strip().split(" ")))
line int(x) for x in line[0:3]])
v_hkl.append([float(line[3]))
v_i.append(= torch.tensor(v_hkl).float()
v_hkl = torch.tensor(v_i).float()
v_i = torch.tensor([7.08991, 10.59464, 19.20684, 100.1068, 93.7396, 101.5610])
v_cell = torch.tensor([cs_key["Monoclinic"]])
v_cs
= 4
repeat = calculate_diffraction_patterns(x, full_data, v_cs.repeat(repeat), v_hkl.repeat(repeat,1,1), v_i.repeat(repeat,1), v_cell.repeat(repeat,1))
diffraction_patterns <= 35], v_i_obs[v_tt <= 35])
plt.plot(v_tt[v_tt "Original verapamil hydrochloride diffraction data")
plt.title("$2\\theta$")
plt.xlabel("Intensity")
plt.ylabel(
plt.show()= plt.subplots(2,2,figsize=(16,8))
fig, ax for i, p, in enumerate(diffraction_patterns):
= i % 2
col = i // 2
row = (lp_factor(plotdata[plotdata <= 35])*p[plotdata <= 35]).squeeze()
corrected_i -= corrected_i.min()
corrected_i /= corrected_i.max()
corrected_i <= 35], corrected_i)
ax[row][col].plot(plotdata[plotdata "$2\\theta$")
ax[row][col].set_xlabel("Intensity")
ax[row][col].set_ylabel("Calculated diffraction patterns")
fig.suptitle( plt.show()
The calculated patterns clearly resemble the observed diffraction data, whilst also having subtle differences in the background, noise levels, peak shapes, peak asymmetries as well as the peak positions and intensities.
Performance
Let’s test the performance of the generation code on an Nvidia Quadro RTX 5000 GPU (16 GB VRAM). We’ve got about 288k patterns to generate, how long will it take?!
Show the code
= "./"
data_dir = "4-44-CuKa1-data_4000_"
base_name = torch.tensor(np.load(data_dir+base_name+"crystal_systems_numeric.npy"))
crystal_systems = torch.tensor(np.load(data_dir+base_name+"hkl.npy")).float()
hkl = torch.tensor(np.load(data_dir+base_name+"intensities.npy")).float()
intensities = torch.tensor(np.load(data_dir+base_name+"unit_cell.npy")).float()
unit_cell = np.load(data_dir+base_name+"cod_id.npy") cod_id
Show the code
from tqdm import tqdm
= torch.device("cuda:0")
device = torch.float32
dtype = 4
ttmin = 44
ttmax = 3.
peakrange = 2048
datadim = torch.linspace(ttmin-(peakrange/2), ttmax+(peakrange/2), int(np.ceil((ttmax-ttmin+peakrange)/((ttmax-ttmin)/datadim))), device=device, dtype=dtype)
full_data = full_data[(full_data >= ttmin) & (full_data <= ttmax)]
plotdata = (full_data[full_data <= ttmin+(peakrange/2)]).clone() - ttmin
x
= 1000
batchsize = torch.arange(batchsize)
idx
for i in tqdm(range(intensities.shape[0]//batchsize)):
= idx + (i*batchsize)
batch_idx = calculate_diffraction_patterns(
diffraction_patterns
x,
full_data,
crystal_systems[batch_idx].to(device),
hkl[batch_idx].to(device),
intensities[batch_idx].to(device),
unit_cell[batch_idx].to(device) )
100%|██████████| 288/288 [00:57<00:00, 4.97it/s]
288k patterns generated in about a minute - this means we can generate about 5000 diffraction patterns per second!
I’m sure there are some tweaks that would boost the performance even more, but for now this will be good enough for initial experiments with neural networks. A quick check with a profiler could give some hints as to where the biggest bottlenecks in the code are:
Show the code
import cProfile as profile
import pstats
= profile.Profile()
prof
prof.enable()= calculate_diffraction_patterns(
diffraction_patterns
x,
full_data,
crystal_systems[batch_idx].to(device),
hkl[batch_idx].to(device),
intensities[batch_idx].to(device),
unit_cell[batch_idx].to(device)
)
prof.disable()= pstats.Stats(prof).strip_dirs().sort_stats("tottime")
stats 10) # top 10 rows stats.print_stats(
294 function calls in 0.422 seconds
Ordered by: internal time
List reduced from 94 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.297 0.297 0.298 0.298 118587661.py:54(calculate_peaks)
1 0.086 0.086 0.086 0.086 118587661.py:67(calculate_full_patterns)
4 0.016 0.004 0.016 0.004 {method 'to' of 'torch._C._TensorBase' objects}
1 0.015 0.015 0.016 0.016 1250221022.py:1(get_MD_PO_components)
1 0.002 0.002 0.422 0.422 3294103451.py:6(<module>)
1 0.002 0.002 0.004 0.004 118587661.py:1(get_peak_positions)
1 0.000 0.000 0.000 0.000 {built-in method torch._C._linalg.linalg_inv}
1 0.000 0.000 0.001 0.001 895306903.py:3(get_unit_cell_perturbation)
1 0.000 0.000 0.405 0.405 118587661.py:82(calculate_diffraction_patterns)
1 0.000 0.000 0.001 0.001 3032108489.py:41(fcj)
<pstats.Stats at 0x1b7da6a57c0>
The filenames look a bit odd as I’m running this in a notebook. However, it looks like the peak calculation is taking the most time, which isn’t too surprising to me as there’s a lot going on there! Perhaps in future work I’ll try to improve the efficiency of the code. For now though, I think it’s good enough for us to start experimenting with training neural networks, which will be the subject of future posts.
PowCod generated data
As a final sense check, let’s see some of the data we can generate from the PowCod database
Show the code
= torch.arange(4)
idx = calculate_diffraction_patterns(
diffraction_patterns
x,
full_data,
crystal_systems[idx].to(device),
hkl[idx].to(device),
intensities[idx].to(device),
unit_cell[idx].to(device)
)= plt.subplots(2,2,figsize=(16,10))
fig, ax for i, d in enumerate(diffraction_patterns):
= i % 2
col = i // 2
row
ax[row][col].plot(plotdata.cpu().numpy(), d.cpu().numpy())"$2\\theta$")
ax[row][col].set_xlabel("Intensity")
ax[row][col].set_ylabel(f"{i+1} - COD entry {cod_id[idx[i]]}")
ax[row][col].set_title(print(f"{i+1} - http://www.crystallography.net/cod/{cod_id[idx[i]]}.html")
plt.show()
1 - http://www.crystallography.net/cod/7220223.html
2 - http://www.crystallography.net/cod/1553394.html
3 - http://www.crystallography.net/cod/7037807.html
4 - http://www.crystallography.net/cod/7036591.html
I downloaded the CIFs for each of the COD entries, and calculated their diffraction patterns using Mercury. I am very pleased that the patterns produced by the code in this post match the patterns produced by Mercury!
Conclusions
In this post, we’ve developed some code to allow us to read in unit cells, Miller indices, crystal system information and intensities, and rapidly generate realistic looking powder X-ray diffraction data on a GPU.
I’ve put all the functions written in this code into a github repository which can be found here - https://github.com/mspillman/powcodgen/ - please feel free to submit issues, pull requests etc if you have any ideas or suggestions for improvement!
In future posts, I’ll show how such data can be used to train artificial neural networks to do lots of interesting and useful things to help crystallographers understand their data.