Generating synthetic diffraction data from PowCod database

Generating synthetic powder X-ray diffraction data for machine learning applications
PXRD
Machine learning
Synthetic data
Author

Mark Spillman

Published

September 8, 2023

Introduction

Neural networks have become a powerful tool for automating complex tasks in crystallography, including PXRD data analysis. For example, indexing, crystallite size and scale factor determination, as well as crystal system and space group classification. There are many more examples in the literature! In future posts, we’ll look at how to train our own neural networks for analysing PXRD data. However, before we get there, we’ll need a large volume of PXRD against which they can be trained.

To the best of my knowledge, there is no extant database of experimental PXRD patterns. However, there are several databases, such as the COD, CCDC’s CSD and ICDD’s PDF, which provide experimental crystal structures. These databases, which contain hundreds of thousands of crystal structures can then be used to calculate diffraction patterns. The CSD has a python API which allows PXRD data to be generated for each of the entries in the database. However, this is not optimised to run quickly, and as such, PXRD data would need to be generated in advance.

In this post, we’ll look at how we can efficiently calculate large numbers of diffraction patterns on-the-fly whilst training neural networks. By writing our own code to generate synthetic PXRD data, we will also be able to implement various data augmentation transformations. Such transformations make the calculated PXRD data look different from the perspective of a machine learning algorithm, but could still plausibly be produced by the same underlying material. This will help to make our future neural networks more robust and able to be applied to real-world diffraction data.

Synthetic Diffraction data using the PowCod database

I used the PowCod database produced by the Bari group as the source for the diffraction data. This database is based on the COD, and for each entry, a Powder X-ray diffraction pattern has been calculated. This is saved in PowCod in the form of the Miller indices of each reflection, the associated intensity, the d-spacing and other relevant information (e.g. multiplicity etc). Being able to read in this information directly, rather than having to calculate this for each of the crystal structures makes this work significantly easier.

As described earlier, PowCod provides lots of information. Most pertinent for this work is the unit cells, space groups, Miller indices and associated intensities for all of the crystal structures. In the next sections, we’ll look at how I filtered the PowCod database, and then discuss the PyTorch-based code I’ve written to generate relatively realistic looking PXRD data on-the-fly using my GPU.

I’ve included several methods of data augmentation:

  • Modification of unit cells to simulate thermal expansion / contraction
  • Modification of intensities using the March-Dollase method to simulate preferred orientation effects
  • Modification of peak shapes, broadening and asymmetry to simulate different crystallite sizes/strains and instrumentation
  • Modification of zero point errors to simulate different instruments
  • Modification of noise to simulate different experimental data collection times / sample sizes

Filtering PowCod data

The PowCod database, which can be obtained here, is made up of several .sql files which need to be parsed to extract the desired information. In our case, what we are after is:

  • Unit cells
  • Crystal systems
  • Space groups
  • Miller indices
  • Intensities

We’ll probably also need to use some additional information to help us along the way, for example, the d-spacing of the reflections.

As the files total only around 5.5 GB, I used pandas to read the database into memory and extract the information needed. I pickled the resultant dataframe to enable it to be more easily read in the future.

import pandas as pd
import os


def read_powcod_sql():
    import sqlite3
    con = sqlite3.connect('cod2205.sq')
    df = pd.read_sql_query('SELECT * FROM id',con)
    con2 = sqlite3.connect('cod2205.sq.info')
    df2 = pd.read_sql_query('SELECT * FROM info',con2)

    df.id = pd.to_numeric(df.id)
    df2.id = pd.to_numeric(df2.id)

    combined = df.merge(df2, left_on="id", right_on="id")
    return combined

if os.path.isfile("combined.pkl"):
    combined = pd.read_pickle("combined.pkl")
else:
    combined = read_powcod_sql()
    print(combined.columns)
    combined = combined[["spacegroup_x", "a", "b", "c", "alpha", "beta",
                    "gamma", "volume", "h", "k", "l", "nd", "dvalue",
                    "intensita", "type", "id"]]
    combined.to_pickle("combined.pkl")

Let’s have a look at what we’ve got:

Show the code
combined
spacegroup_x a b c alpha beta gamma volume h k l nd dvalue intensita type id
0 C m c m 0.0000 0.0000 0.0000 0.0 0.000 0.0 738.078 0 0 0 1 0.000000 0.000000 Orthorhombic 1010866
1 I 4/m m m 0.0000 0.0000 0.0000 0.0 0.000 0.0 388.473 0 0 0 1 0.000000 0.000000 Tetragonal 6000382
2 P -1 0.0000 0.0000 0.0000 0.0 0.000 0.0 429.928 0 0 0 1 0.000000 0.000000 Triclinic 6000073
3 P -3 0.0000 0.0000 0.0000 0.0 0.000 0.0 1221.297 0 0 0 1 0.000000 0.000000 Trigonal (hexagonal axes) 6000152
4 P 1 0.0000 0.0000 0.0000 0.0 0.000 0.0 552.983 0 0 0 1 0.000000 0.000000 Cubic 6000430
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
458155 P 1 21 1 6.6330 15.4171 11.9417 90.0 103.335 90.0 1188.257 0,0,0,1,0,1,1,1,0,0,1,1,1,1,1,0,1,0,1,1,1,1,1,... 0,1,2,0,2,0,1,1,0,1,0,2,0,1,2,3,1,2,2,2,3,3,0,... 1,1,0,0,1,-1,0,-1,2,2,1,0,-2,1,-1,1,-2,2,1,-2,... 499 11.619700,9.279300,7.708500,6.454200,6.423600,... 1000.000000,62.638200,82.940500,3.929300,0.067... Monoclinic 4344045
458156 C 1 2/m 1 21.2287 17.8117 12.3055 90.0 124.759 90.0 3822.662 1,1,0,2,0,2,0,2,1,3,2,2,1,1,3,3,1,2,4,2,0,4,1,... 1,1,0,0,2,0,2,2,1,1,2,0,3,1,1,1,3,0,0,2,0,0,3,... 0,-1,1,-1,0,0,1,-1,1,-1,0,-2,0,-2,0,-2,-1,1,-1... 498 12.461500,10.123800,10.109700,9.999700,8.90580... 1000.000000,1.459300,384.189200,186.432500,88.... Monoclinic 4123491
458157 C 1 2/m 1 21.5314 18.0366 12.3918 90.0 124.872 90.0 3948.216 1,1,0,2,0,2,0,2,1,3,2,2,1,3,1,3,1,2,4,2,0,4,1,... 1,1,0,0,2,0,2,2,1,1,2,0,3,1,1,1,3,0,0,2,0,0,3,... 0,-1,1,-1,0,0,1,-1,1,-1,0,-2,0,0,-2,-2,-1,1,-1... 498 12.620300,10.213500,10.166600,10.122700,9.0183... 1000.000000,0.425500,401.199700,131.443500,109... Monoclinic 4123492
458158 I -4 2 m 10.2400 10.2400 9.6520 90.0 90.000 90.0 1012.086 1,1,2,0,2,1,2,2,3,3,1,2,3,3,2,4,3,0,4,3,4,1,4,... 1,0,0,0,1,1,2,0,1,0,0,2,2,1,1,0,3,0,1,0,2,1,0,... 0,1,0,2,1,2,0,2,0,1,3,2,1,2,3,0,0,4,1,3,0,4,2,... 302 7.240800,7.023700,5.120000,4.826000,4.137400,4... 435.793700,1000.000000,156.773800,74.174400,1.... Tetragonal 1508227
458159 P n a 21 10.3130 8.1940 4.9930 90.0 90.000 90.0 421.932 1,2,2,0,0,1,1,2,2,2,3,1,2,3,1,3,4,0,4,2,0,1,3,... 1,0,1,1,2,1,2,0,1,2,1,2,2,1,3,2,0,0,1,3,3,3,2,... 0,0,0,1,0,1,0,1,1,0,0,1,1,1,0,0,0,2,0,0,1,1,1,... 484 6.415500,5.156500,4.364200,4.263800,4.097000,3... 884.735400,207.638600,175.487800,594.721200,42... Orthorhombic 8103382

458160 rows × 16 columns

Show the code
combined.describe()
a b c alpha beta gamma volume nd id
count 458160.000000 458160.000000 458160.000000 458160.000000 458160.000000 458160.000000 458160.000000 458160.000000 4.581600e+05
mean 12.505100 13.372528 16.330272 89.503372 95.199651 90.906179 3178.613746 454.593930 4.604041e+06
std 6.874009 6.655914 8.092967 8.326557 11.649541 10.924374 6384.755145 119.026662 2.281053e+06
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000e+06
25% 8.282900 9.153000 11.279975 90.000000 90.000000 90.000000 1089.974000 498.000000 2.221300e+06
50% 10.780000 12.130000 15.169000 90.000000 92.430000 90.000000 1976.458000 499.000000 4.302406e+06
75% 14.703000 16.140000 19.860000 90.000000 101.688000 90.000000 3562.973750 500.000000 7.046806e+06
max 189.800000 150.000000 475.977800 150.172000 173.895000 149.900000 740478.188000 500.000000 9.016730e+06
Show the code
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(2, 4, figsize=(18,7))
sns.kdeplot(combined.a, ax=ax[0][0])
sns.kdeplot(combined.b, ax=ax[0][1])
sns.kdeplot(combined.c, ax=ax[0][2])
sns.kdeplot(combined.volume, ax=ax[0][3])
sns.kdeplot(combined.alpha, ax=ax[1][0])
sns.kdeplot(combined.beta, ax=ax[1][1])
sns.kdeplot(combined.gamma, ax=ax[1][2])
sns.kdeplot(combined.nd, ax=ax[1][3])
for i in range(2):
    for j in range(3):
        ax[i][j+1].set_ylabel("")
plt.show()

We’re starting off with 458k samples, but a lot of these won’t be useful for us. I’m choosing to focus on materials with relatively small unit cells, and attempting to generate realistic looking laboratory diffraction data.

We can see that some unit cells are listed as having zero volume, and have a value of 0 for the unit cell lengths and angles. Our goal of modifying unit cells and using them to generate the diffraction data as a form of data augmentation means that these entries are not useful to us, so we’ll need to remove them. We can also see that there are some monstrous unit cells in there, with volumes of >700,000 ų!

To make things manageable for future machine learning projects, we’re going to set limits on the allowable unit cell dimensions and volumes. Let’s set the lower and upper limits for volume to 400 and 4000 ų respectively, and set the maximum unit cell length to 50 Å. We can also see some unusual looking unit cell angles, so I’m going to trim entries with cell angles below 60 and above 120 degrees (keeping 120 degrees is important to ensure we keep hexagonal unit cells).

Eventually, we’re going to generate diffraction data. I’ve chosen to focus on realistic looking laboratory diffraction data, and hence I’m choosing to work in the (arbirarily chosen!) range 4 - 44 ° \(2\theta\) with Cu K-α1 radiation. We’ll want to ensure that there are at least a few easily detectable peaks within this chosen range. I’ll set the lower bound for this as 10.

min_volume = 400
max_volume = 4000
max_cell_length = 50
min_cell_angle = 60
max_cell_angle = 120
min_data_angle = 4
max_data_angle = 44
wavelength = 1.54056
min_detectable_peaks_in_range = 10

Unit cell dimensions and volume

Let’s first filter by the unit cell restrictions on volume, lengths and angles. We can also filter by the number of d-spacings (in the column nd) though we will need to come back to this again later to check that the remaining reflections are both in the range desired and also have reasonable intensities.

Show the code
combined = combined[(combined.volume <= max_volume) & (combined.volume >= min_volume)]
combined = combined[(combined.a != 0) & (combined.b != 0) & (combined.c != 0)]
combined = combined[(combined.a <= max_cell_length) & (combined.b <= max_cell_length) & (combined.c <= max_cell_length)]
combined = combined[(combined.alpha >= min_cell_angle) & (combined.alpha <= max_cell_angle)]
combined = combined[(combined.beta >= min_cell_angle) & (combined.beta <= max_cell_angle)]
combined = combined[(combined.gamma >= min_cell_angle) & (combined.gamma <= max_cell_angle)]
combined = combined[(combined.nd >= min_detectable_peaks_in_range)]
combined.describe()
a b c alpha beta gamma volume nd id
count 324333.000000 324333.000000 324333.000000 324333.000000 324333.000000 324333.000000 324333.000000 324333.000000 3.243330e+05
mean 10.992746 12.400612 15.034673 89.719327 94.956107 90.499324 1849.638022 477.448866 4.549946e+06
std 4.372923 4.840500 5.626536 7.474114 9.913211 9.476333 931.743127 81.543710 2.265363e+06
min 2.808500 2.820000 2.718100 60.019000 60.130000 60.000000 400.059000 16.000000 1.000000e+06
25% 8.236400 9.237500 11.247500 90.000000 90.000000 90.000000 1085.857000 498.000000 2.219163e+06
50% 10.114500 11.585500 14.365300 90.000000 93.216000 90.000000 1699.288000 499.000000 4.129720e+06
75% 12.609000 14.506900 17.908600 90.000000 101.601000 90.000000 2515.215000 500.000000 7.047435e+06
max 49.953000 49.977000 49.944000 119.990000 120.000000 120.000000 3999.991000 500.000000 9.016729e+06

That’s a fairly significant reduction in the number of samples - the size of the dataset has gone down by over 100k! However, hopefully we’ve now gotten rid of most of the really odd or unusable samples in the dataset, and the remaining filtering won’t have such a drastic effect on the remaining entries.

Let’s now replot the kernel density plots we saw earlier and see how things look after the filtering.

Show the code
fig, ax = plt.subplots(2, 4, figsize=(18,7))
sns.kdeplot(combined.a, ax=ax[0][0])
sns.kdeplot(combined.b, ax=ax[0][1])
sns.kdeplot(combined.c, ax=ax[0][2])
sns.kdeplot(combined.volume, ax=ax[0][3])
sns.kdeplot(combined.alpha, ax=ax[1][0])
sns.kdeplot(combined.beta, ax=ax[1][1])
sns.kdeplot(combined.gamma, ax=ax[1][2])
sns.kdeplot(combined.nd, ax=ax[1][3])
plt.show()

Things look a bit nicer now! Just out of interest, we’ll take a quick look at the distribution of crystal systems and space groups. Other work looking at the CSD and other databases has shown a significant imbalance in the number of crystal structures adopting each crystal system and indeed space group.

Show the code
combined.type.value_counts().sort_values().iloc[::-1].head(20).plot.bar()
plt.show()

Show the code
combined.spacegroup_x.value_counts().sort_values().iloc[::-1].head(20).plot.bar()
plt.show()

Unsurprisingly, we see that some space groups are much more highly represented than others. If we decide to try crystal system, space group or extinction symbol classification using this data, we will need to keep this fairly severe class imbalance in mind!

For convenience for future work, it’d be good to convert these labels into their international tables numbers. Let’s write a quick function to do that, getting rid of any entries that have unknown space groups, or space group labels that are otherwise difficult to interpret.

Show the code
from pymatgen.symmetry import groups
import numpy as np

symbol_to_num = {}
for i, s in enumerate(np.unique(combined.spacegroup_x.to_numpy())):
    try:
        g = groups.SpaceGroup(s)
        symbol_to_num[s] = g.int_number
    except:
        symbol_to_num[s] = -1

def label_SGs(sg):
    return symbol_to_num[sg]

combined["sg_number"] = combined.spacegroup_x.apply(label_SGs)
combined = combined[combined.sg_number > 0]
combined.sg_number.describe()
count    323694.000000
mean         24.984507
std          44.618451
min           1.000000
25%           2.000000
50%          14.000000
75%          15.000000
max         230.000000
Name: sg_number, dtype: float64

We lost <1000 entries in that step, which is good news!

Diffraction data

We can now turn our attention to the diffraction data - the intensities, Miller indices and d-spacings. Some of the entries will have a huge number of peaks, some will have very few. To start with, I’m choosing to exclude samples that have reflections below the bottom of our data range. I also want to ensure that there is still at least one peak above the top end of our data range which ensures that we aren’t missing any data inside our chosen data range.

Our dataframe contains the d-spacings for the observed reflections, so using Bragg’s law we can easily convert the angular range bounds into d-spacings, and then apply the filtering just described.

Show the code
import numpy as np

d_from_tt = lambda x: wavelength/(2*np.sin(np.deg2rad(x)/2))
tt_from_d = lambda x: 2*np.rad2deg(np.arcsin(wavelength/(2*x)))
max_d = d_from_tt(min_data_angle)
min_d = d_from_tt(max_data_angle)
print("Wavelength:", wavelength)
print(f"Minimum angle = {min_data_angle}  : Maximum d-spacing = {max_d:.3f}")
print(f"Maximum angle = {max_data_angle} : Minimum d-spacing = {min_d:.3f}")
Wavelength: 1.54056
Minimum angle = 4  : Maximum d-spacing = 22.071
Maximum angle = 44 : Minimum d-spacing = 2.056
Show the code
def d_spacing_limits(x, min_d_spacing=1, max_d_spacing=100):
    d = list(filter(None, x.strip().split(",")))
    d = [float(i) for i in d]
    if len(d) == 0:
        return False
    else:
        if d[-1] > min_d_spacing:
            return False
        if d[0] > max_d_spacing:
            return False
        return True

combined["d_spacing_filter"] = combined.dvalue.apply(d_spacing_limits, args=(min_d, max_d))
Show the code
combined = combined[combined["d_spacing_filter"] == True]
combined.describe()
a b c alpha beta gamma volume nd id sg_number
count 294047.000000 294047.000000 294047.000000 294047.000000 294047.000000 294047.000000 294047.000000 294047.000000 2.940470e+05 294047.000000
mean 10.880480 12.250730 14.660952 89.810688 95.496771 90.697710 1761.181410 475.436522 4.502841e+06 27.194466
std 4.407073 4.930788 5.572797 6.735834 9.633053 8.870661 911.606182 85.121247 2.291774e+06 46.176416
min 2.808500 2.825500 2.718100 60.019000 60.130000 60.000000 400.059000 16.000000 1.000000e+06 1.000000
25% 8.104550 9.061050 10.951000 90.000000 90.000000 90.000000 1033.725000 498.000000 2.214880e+06 4.000000
50% 9.900700 11.252000 13.876500 90.000000 93.612000 90.000000 1587.395000 499.000000 4.122961e+06 14.000000
75% 12.467950 14.239950 17.401100 90.000000 101.996000 90.000000 2354.773000 500.000000 7.047926e+06 19.000000
max 45.366000 48.809000 49.494500 119.990000 120.000000 120.000000 3999.991000 500.000000 9.016729e+06 230.000000

We are down to just shy of 300k samples now, with one last bit of filtering to apply. Lastly, we’ll ensure that we have sufficient easily detectable peaks within our data range. Do to this, we need to look at the intensities:

Show the code
combined.intensita
11        1000.000000,293.781700,0.932900,880.099700,350...
12        344.517500,136.663800,113.543400,41.743900,4.9...
13        480.841100,1000.000000,345.837700,13.887000,72...
14        6.550100,7.094000,99.801200,1000.000000,642.20...
17        2.707700,223.647400,1000.000000,119.345700,113...
                                ...                        
458148    191.292600,19.724900,41.216500,169.019100,150....
458151    1000.000000,782.379000,202.061600,29.027900,13...
458155    1000.000000,62.638200,82.940500,3.929300,0.067...
458158    435.793700,1000.000000,156.773800,74.174400,1....
458159    884.735400,207.638600,175.487800,594.721200,42...
Name: intensita, Length: 294047, dtype: object

Intensities are stored as a string of comma separated values, with each pattern scaled to a maximum intensity of 1000. Next we should ensure that in our chosen data range, the intensities that are present are actually strong enough to be observed. Let’s say we want the minimum intensity in our data range to be 0.5 % of the maximum intensity. As such, there should have at least \(N\) intensities in our range with a value greater than 5. As discussed earlier, I’ve chosen to set \(N = 10\).

Show the code
def n_peaks_in_range(dspacing, intensity, min_intensity=5):
    dspacing = list(filter(None, dspacing.strip().split(",")))
    dspacing = np.array([float(i) for i in dspacing])
    intensity = list(filter(None, intensity.strip().split(",")))
    intensity = np.array([float(i) for i in intensity])
    intensity = intensity[dspacing >= min_d]
    return (intensity > min_intensity).sum()

combined["n_detectable_peaks_in_range"] = combined.apply(lambda x: n_peaks_in_range(x.dvalue, x.intensita), axis=1)
combined["n_detectable_peaks_in_range"].describe()
count    294047.000000
mean        113.481430
std          59.305561
min           0.000000
25%          72.000000
50%         107.000000
75%         149.000000
max         386.000000
Name: n_detectable_peaks_in_range, dtype: float64

As we can see, some of the patterns we might generate would have very few or no easily detectable peaks in our chosen data range. As our final filtering step, we’ll get rid of them, leaving us with our final filtered dataset.

Show the code
combined = combined[combined.n_detectable_peaks_in_range >= min_detectable_peaks_in_range]
combined.describe()
a b c alpha beta gamma volume nd id sg_number n_detectable_peaks_in_range
count 288123.000000 288123.000000 288123.000000 288123.000000 288123.000000 288123.000000 288123.000000 288123.000000 2.881230e+05 288123.000000 288123.000000
mean 10.915994 12.314144 14.769507 89.806797 95.609645 90.698470 1780.427861 484.062508 4.493060e+06 23.272384 115.697018
std 4.439603 4.955456 5.568620 6.804673 9.698929 8.939179 909.067811 60.494039 2.259233e+06 37.387276 57.842270
min 2.808500 2.825500 2.760000 60.019000 60.130000 60.000000 400.059000 30.000000 1.000000e+06 1.000000 10.000000
25% 8.099250 9.133100 11.122850 90.000000 90.000000 90.000000 1054.453000 498.000000 2.216278e+06 4.000000 75.000000
50% 9.925000 11.332000 13.984900 90.000000 93.924000 90.000000 1608.534000 499.000000 4.122969e+06 14.000000 108.000000
75% 12.542100 14.337000 17.492400 90.000000 102.190000 90.000000 2378.520500 500.000000 7.045938e+06 15.000000 151.000000
max 45.366000 48.809000 49.494500 119.990000 120.000000 120.000000 3999.991000 500.000000 9.016729e+06 230.000000 386.000000

Saving the filtered data

Now that we’ve filtered our data to our requirements, the last thing to do is save the relevant information in numpy arrays which can then be quickly and easily loaded from disk when needed.

The arrays to be saved will contain:

  • Unit cells
  • Crystal systems
  • Miller indices
  • Peak intensities
  • Space groups
  • COD IDs in case we want to look up a particular crystal structure

As the number of peaks in each pattern is different, the Miller indices and Peak intensity arrays will need padding up to the maximum number, which is 500.

Show the code
def get_array(x, dtype=np.float32):
    data = np.zeros(500, dtype=dtype)
    x = list(filter(None, x.strip().split(",")))
    x = np.array(x, dtype=dtype)
    data[:len(x)] = x
    return data

h = combined.h.apply(get_array, args=(np.int64,))
k = combined.k.apply(get_array, args=(np.int64,))
l = combined.l.apply(get_array, args=(np.int64,))
intensities = combined.intensita.apply(get_array, args=(np.float32,))

intensities = np.vstack(intensities)
h = np.vstack(h)
k = np.vstack(k)
l = np.vstack(l)
hkl = np.dstack([h,k,l])

unit_cell = combined[["a","b","c","alpha","beta","gamma"]].to_numpy()
crystal_systems, crystal_systems_numeric = np.unique(combined.type.to_numpy(), return_inverse=True)
space_group_number = combined.sg_number.to_numpy()
cod_id = combined.id.to_numpy()
print(crystal_systems)

Now we can save everything, as well as a configuration file to keep track of our settings, which we’ll save using the json format.

Show the code
import json

base_name = f"{min_data_angle}-{max_data_angle}-CuKa1-data_{max_volume}_"
config = {}
config["base_name"] = base_name
config["min_volume"] = min_volume
config["max_volume"] = max_volume
config["max_cell_length"] = max_cell_length
config["min_cell_angle"] = min_cell_angle
config["max_cell_angle"] = max_cell_angle
config["min_data_angle"] = min_data_angle
config["max_data_angle"] = max_data_angle
config["wavelength"] = wavelength
config["min_detectable_peaks_in_range"] = min_detectable_peaks_in_range
config["crystal_systems"] = [x[0]+" = "+str(x[1]) for x in zip(crystal_systems, np.arange(len(crystal_systems)))]

with open("data_config.json", "w") as f:
    json.dump(config, f, indent=4)
np.save(base_name+"unit_cell.npy", unit_cell)
np.save(base_name+"crystal_systems_numeric.npy", crystal_systems_numeric)
np.save(base_name+"space_group_number.npy", space_group_number)
np.save(base_name+"hkl.npy", hkl)
np.save(base_name+"intensities.npy", intensities)
np.save(base_name+"cod_id.npy", cod_id)

Generating synthetic data

Now that we’ve filtered down the original PowCod database into the entries that meet the (arbitrary) conditions described earlier, it’s time to start thinking about how we can use the information to generate synthetic PXRD data.

I’ve split this task into:

  • Peak positions
  • Peak intensities
  • Peak shapes
  • Noise and background

Peak positions

Rather than using the d-spacings from the database directly, we are going to make use of the unit cells and Miller indices to calculate the d-spacings (and hence angles) for the reflections. This will allow us to add a small amount of noise to the unit cell dimensions, which serves as a form of data augmentation.

The d-spacing of a given reflection can be calculated using this equation:

\(\frac{1}{d^{2}} = X_{hh}h^2 + X_{kk}k^2 + X_{ll}l^2 + X_{hk}hk +X_{hl}hl + X_{kl}kl\)

where \(X_{nn}\) are elements of the reciprocal space metric tensor.

This means that we will need a function to convert the unit cell dimensions into a matrix representation of the unit cell, invert this to determine the reciprocal lattice matrix and from this obtain the reciprocal lattice metric tensor. Once we have the reciprocal lattice metric tensor, we can calculate the d-spacings quite easily by multiplying with the Miller index array(s). However, one complication is that we want to do this for several samples simultaneously, so need to consider the shapes of our arrays.

Before dealing with that, let’s write a function to simulate the effect of temperature changes, by adding a small amount of Gaussian noise to the unit cell dimensions. We will have to consider the crystal system symmetry here in order to ensure that, for example a cubic unit cell, the perturbation to lengths \(b\) and \(c\) is equal to the pertubation to \(a\). The standard deviation of the random noise to be added is going to be set to a default value of 0.05, meaning that the majority of perturbed unit cells should have edges with lengths +/- 0.15 Å from the database values, and angles +/- 0.15 ° (assuming they are allowed to vary by symmetry). This seems reasonable, and should allow us to assume that the fractional coordinates within the unit cells are unaffected, so we don’t need to worry about the effect on the relative intensities.

Show the code
import torch

def get_unit_cell_perturbation(crystal_systems, dtype=torch.float32, stddev=0.05):
    """Generate a perturbation for the unit cell lengths and angles for a given
    crystal system.

    Args:
        crystal_systems (tensor):   Crystal systems for the unit cells.
                                    Shape = (batch)
        dtype (torch.dtype, optional): Defaults to torch.float32.
        stddev (float, optional):   Standard deviation for the gaussian noise.
                                    Defaults to 0.05.

    Returns:
        tensor: A tensors to additively adjust the unit cell lengths and angles
                Shape = (batch, 6)
    """
    batchsize = crystal_systems.shape[0]
    device = crystal_systems.device
    lengths, angles = torch.randn((2, batchsize, 3), device=device, dtype=dtype) * stddev
    cubic = crystal_systems == 0
    hexagonal = crystal_systems == 1
    monoclinic = crystal_systems == 2
    #orthorhombic = crystal_systems == 3 # Don't need to query data for this
    tetragonal = crystal_systems == 4
    triclinic = crystal_systems == 5
    trigonal_h = crystal_systems == 6
    trigonal_r = crystal_systems == 7

    # Cubic, tetragonal, rhombohedral and hexagonal - a and b must be the same
    lengths[:,1] = torch.where(cubic | hexagonal | tetragonal | trigonal_h | trigonal_r,
                            lengths[:,0], lengths[:,1])
    # Cubic and rhombohedral cells - a, b and c must be the same
    lengths[:,2] = torch.where(cubic | trigonal_r, lengths[:,0], lengths[:,2])
    # Rhombohedral and triclinic cells - could change their alpha values
    angles[:,0] = torch.where((trigonal_r | triclinic), angles[:,0], 0.)
    # Triclinic or monoclinic cells could change beta values
    angles[:,1] = torch.where((triclinic | monoclinic), angles[:,1], 0.)
    # Triclinc cells could change gamma
    angles[:,2] = torch.where(triclinic, angles[:,2], 0.)
    # Rhombohedral cells - need to ensure all angles are the same
    angles[:,1] = torch.where(trigonal_r, angles[:,0], angles[:,1])
    angles[:,2] = torch.where(trigonal_r, angles[:,0], angles[:,2])

    return torch.concat([lengths, angles], dim=-1)

Let’s check it does what it’s meant to!

What we’re expecting is that for the unit cell lengths, the perturbation should be the same for each of the edges for cubic and rhombohedral cells, and the perturbation for \(a\) and \(b\) should be the same for hexagonal and tetragonal cells.

With regards to angles, where the original unit cell is an angle of 90 or 120 degrees, the perturbation should be zero. So we should see a change in the beta value for monoclinic cells, a change for all values of the triclinic unit cells and the same change for each of the angles for rhombohedral cells.

Show the code
cs_key =  {
    "Cubic" : 0,
    "Hexagonal" :  1,
    "Monoclinic" : 2,
    "Orthorhombic" : 3,
    "Tetragonal" : 4,
    "Triclinic" : 5,
    "Trigonal (hexagonal axes)" : 6,
    "Trigonal (rhombohedral axes)" : 7
}

for c in cs_key.items():
    name, number = c
    perturb = get_unit_cell_perturbation(torch.tensor([[int(number)]]))
    lengths = perturb[:,:3]
    angles = perturb[:,3:]
    print(name, "Length perturbation:", lengths)
    print(name, "Angle perturbation:", angles,"\n")
Cubic Length perturbation: tensor([[0.0120, 0.0120, 0.0120]])
Cubic Angle perturbation: tensor([[0., 0., 0.]]) 

Hexagonal Length perturbation: tensor([[0.0535, 0.0535, 0.0225]])
Hexagonal Angle perturbation: tensor([[0., 0., 0.]]) 

Monoclinic Length perturbation: tensor([[-0.0509,  0.0907, -0.0980]])
Monoclinic Angle perturbation: tensor([[0.0000, 0.0176, 0.0000]]) 

Orthorhombic Length perturbation: tensor([[-0.0788, -0.0403, -0.0860]])
Orthorhombic Angle perturbation: tensor([[0., 0., 0.]]) 

Tetragonal Length perturbation: tensor([[-0.0156, -0.0156, -0.0144]])
Tetragonal Angle perturbation: tensor([[0., 0., 0.]]) 

Triclinic Length perturbation: tensor([[-0.0154, -0.0387,  0.0159]])
Triclinic Angle perturbation: tensor([[ 0.0273, -0.0582, -0.0836]]) 

Trigonal (hexagonal axes) Length perturbation: tensor([[-0.0063, -0.0063,  0.0845]])
Trigonal (hexagonal axes) Angle perturbation: tensor([[0., 0., 0.]]) 

Trigonal (rhombohedral axes) Length perturbation: tensor([[-0.0002, -0.0002, -0.0002]])
Trigonal (rhombohedral axes) Angle perturbation: tensor([[-0.0607, -0.0607, -0.0607]]) 

Looks good to me. We now need to start writing functions to generate our lattice matrix, reciprocal lattice matrix and reciprocal lattice metric tensor. We can then use these in conjunction with the Miller indices to determine the reflection d-spacings.

Given that we are modifying unit cells, it’s probably sensible to check that the perturbed unit cells are valid. We can do that by calculating the volume of the unit cells, with invalid unit cells returning a volume of zero.

Show the code
def get_lattice_matrix(unit_cell_dimensions):
    """calculate a lattice matrix from unit cell dimensions

    Args:
        unit_cell_dimensions (tensor):  The unit cell dimensions. Lengths in
                                        angstroms, angles in degrees.
                                        Shape = (batch, 6)

    Returns:
        tensor: The matrix representation of the unit cells.
                Shape = (batch, 3, 3)
    """
    pi_over_180=0.017453292519943295
    a, b, c = unit_cell_dimensions[:,:3].T
    cosal, cosbe, cosga = torch.cos(unit_cell_dimensions[:,3:]*pi_over_180).T
    sinal, sinbe = torch.sin(unit_cell_dimensions[:,3:-1]*pi_over_180).T
    # Sometimes rounding errors cause |values| slightly > 1.
    val = torch.clamp((cosal * cosbe - cosga) / (sinal * sinbe), min=-1, max=1)

    gamma_star = torch.arccos(val)
    zeros = torch.zeros_like(a)
    v_a = torch.stack([a * sinbe, zeros, a*cosbe]).T
    v_b = torch.stack([-b * sinal * torch.cos(gamma_star),
                    b*sinal * torch.sin(gamma_star),
                    b*cosal]).T
    v_c = torch.stack([zeros, zeros, c]).T

    matrix = torch.stack([v_a,v_b,v_c], dim=2)

    # Unit cells are valid if cell volume > 0
    # The cell volume is |det(M)|, but don't need the absolute value here
    volume = torch.linalg.det(matrix)
    valid = volume != 0

    return matrix, valid

Let’s quickly check that this works as expected. I’ll give it a simple cubic unit cell, and an invalid unit cell.

Show the code
cubic_cell = torch.tensor([5.0, 5.0, 5.0, 90.0, 90.0, 90.0]).unsqueeze(0)
invalid_cell = torch.tensor([5.0, 6.0, 7.0, 120.0, 120.0, 120.0]).unsqueeze(0)
print("Cubic cell valid -",get_lattice_matrix(cubic_cell)[1].item())
print("Invalid cell valid -", get_lattice_matrix(invalid_cell)[1].item())
Cubic cell valid - True
Invalid cell valid - False

The last couple of functions we need to implement then include a way to get the reciprocal lattice metric tensor. The reciprocal lattice matrix can be easily obtained by inverting the lattice matrix we calculated previously.

We also need a way to obtain the reflection d-spacings, and then lastly convert the d-spacings to \(2\theta\).

Show the code
def get_recip_lattice_metric_tensor(recip_lattice_matrix):
    """Calculate the reciprocal lattice metric tensor

    Args:
        recip_lattice_matrix (tensor):  Reciprocal lattice matrix
                                        Shape = (batch, 3, 3)

    Returns:
        tensor: Reciprocal lattice metric tensor
                Shape = (batch, 3, 3)
    """
    return recip_lattice_matrix @ recip_lattice_matrix.permute(0,2,1)


def get_d_spacing(recip_latt_metric_tensor,hkl):
    """Calculate the d-spacings for the reflections from the Miller indices and
    the reciprocal lattice metric tensor

    Args:
        recip_latt_metric_tensor (tensor):  Reciprocal lattice metric tensor
                                            Shape = (batch, 3, 3)
        hkl (tensor):   Miller indices
                        Shape = (batch, number of reflections, 3)

    Returns:
        tensor: d-spacing for each of the reflections
                Shape = (batch, number of reflections)
    """
    one_over_d_squared = torch.einsum("bij,bji->bi",hkl,torch.einsum(
                                "bij,bkj->bik",recip_latt_metric_tensor,hkl))
    d = 1/torch.sqrt(one_over_d_squared)
    return d


def d_to_tt(d,wavelength=1.54056):
    """Convert d-spacings to twotheta values (in degrees)

    Args:
        d (tensor): d-spacings for each of the reflections
                    Shape = (batch, number of reflections)
        wavelength (float): The wavelength of the radiation. Defaults to
                            1.54056 which is copper K-a1

    Returns:
        tensor: twotheta values for each of the reflections
                Shape = (batch, number of reflections)
    """
    two_times_180_over_pi = 114.59155902616465
    tt = two_times_180_over_pi*torch.arcsin(wavelength/(2*d))
    return tt

def get_zero_point_error(batchsize, device, dtype, zpemin=0.03, zpemax=0.03):
    """
    Generate a random zero-point error to be applied to the peak positions

    Args:
        batchsize (int): Size of the batch dimension
        device (torch.device): Device to generate the tensor on
        dtype (torch.dtype): dtype to use for the ZPEs
        zpemin (float, optional): Lower bound for zero point error in degrees
        zpemax (float, optional): Upper bound for zero point error in degrees

    Returns:
        tensor: Zero point error to be applied to the peak positions
                Shape = (batch, 1)
    """
    zero_point_error = (torch.rand((batchsize,1), device=device, dtype=dtype)
                        * (zpemax - zpemin)) + zpemin
    return zero_point_error

To test the whole set of functions written so far, I’ll add a convenience function to calculate the peak positions from the unit cells and Miller indices.

Show the code
def get_peak_positions(unit_cells, hkl, zpe=True, wavelength=1.54056):
    lattice_matrix, valid = get_lattice_matrix(unit_cells)
    unit_cells = unit_cells[valid]
    hkl = hkl[valid]
    reciprocal_lattice_matrix = torch.linalg.inv(lattice_matrix)
    reciprocal_lattice_metric_tensor = get_recip_lattice_metric_tensor(reciprocal_lattice_matrix)
    d_spacing = get_d_spacing(reciprocal_lattice_metric_tensor, hkl)
    twotheta = d_to_tt(d_spacing, wavelength)
    if zpe:
        zero_point_error = get_zero_point_error(twotheta.shape[0],
                                                twotheta.device,
                                                twotheta.dtype)
        twotheta += zero_point_error
    return twotheta

I’ll test that using some diffraction data for a Carbamazepine:Indomethacin cocrystal I published a while ago. The first 3 peaks have the following Miller indices and positions:

\(h\) \(k\) \(l\) \(2\theta\)
1 1 0 5.322
2 0 0 7.496
0 2 0 7.562

And the unit cell is:

\(a\) = 23.573194, \(b\) = 23.363375, \(c\) = 5.125218

\(\alpha\) = 90.000, \(\beta\) = 88.77132, \(\gamma\) = 90.000

We’ll run the function both with and without zero point errors applied.

Show the code
coXtal_unit_cell = torch.tensor([23.573194, 23.363375, 5.125218, 90.000, 88.77132, 90.000])
coXtal_hkl = torch.tensor([[1,1,0],[2,0,0],[0,2,0]]).float()

for zpe in [False, True]:
    print("ZPE:",zpe)
    coXtal_twotheta = get_peak_positions(coXtal_unit_cell.reshape(1,6), coXtal_hkl.reshape(1,3,3), zpe=zpe)
    print("[h, k, l]  2θ")
    print("----------------")
    for i, ht in enumerate(zip(coXtal_hkl, coXtal_twotheta.squeeze())):
        print(ht[0].int().tolist(), f"{ht[1].item():.3f}")
    print("")
ZPE: False
[h, k, l]  2θ
----------------
[1, 1, 0] 5.322
[2, 0, 0] 7.496
[0, 2, 0] 7.562

ZPE: True
[h, k, l]  2θ
----------------
[1, 1, 0] 5.352
[2, 0, 0] 7.526
[0, 2, 0] 7.592

That looks like it’s working nicely - without a zero point error, the peak positions match, whilst with an error, the peak positions are all shifted by a consistent amount.

Peak intensities

The main contribution to the intensity of a diffraction peak is the position of the atoms within the unit cell. What makes the PowCod database so helpful is that this has already been calculated, and can be read in from the intensities array we saved earlier.

However, there are experimental/physical effects that can modulate these intensities. Most import is the effect of preferred orientation. There are several ways to model this, I’m choosing to implement the March-Dollase method as I already have some code written to do this in GALLOP. This code is actually based on code from GSAS-II; see the GetPrefOri function in the file found here. I updated it to use pytorch, and work over a batch of multiple PXRD patterns.

Again, this is being implemented as a form of data augmentation, allowing our existing data to generate multiple different, but plausible, diffraction patterns. As such, the preferred orientation axis is limited to being in the [1,0,0], [0,1,0] or [0,0,1] directions.

Future work could extend this to use the more sophisticated spherical harmonic approach described here.

Show the code
def get_MD_PO_components(hkl, reciprocal_lattice_metric_tensor, dspacing, factor_std=0.1):
    """Calculate the terms needed in the March-Dollase preferred orientation
    correction.

    Args:
        hkl (tensor):   The Miller indices of the reflections.
                        Shape = (batch, number of reflections, 3)
        reciprocal_lattice_matrix (tensor): The reciprocal lattice matrix
                                            Shape = (batch, 3, 3)
        factor_std (float, optional):   The standard deviation for the normally
                                        distributed March-Dollase factors.
                                        Defaults to 0.1.

    Returns:
        tuple: Tuple of tensors containing the terms needed in the March-Dollase
        preferred orientation correction function. cosP, sinP = cosine and sine
        of the angle between the Miller indices and the PO axis. Factor = the
        March Dollase factors, PO_axis = the preferred orientation axis.
    """
    batchsize = hkl.shape[0]
    device = hkl.device
    dtype = hkl.dtype
    # Randomly assign the PO axis to be either [1,0,0], [0,1,0] or [0,0,1]
    PO_axis = torch.zeros((batchsize, 3), device=device, dtype=dtype)
    PO_axis_select = torch.randint(0,3,(batchsize,),device=device)
    PO_axis[torch.arange(batchsize, device=device),PO_axis_select] = 1.0


    """u = hkl / torch.sqrt(torch.einsum("bkj,bkj->bk", hkl,
                        torch.einsum("bij,bkj->bki",
                        reciprocal_lattice_matrix,hkl))).unsqueeze(2)"""
    # Dividing the Miller indices by the reciprocal lattice vector lengths is
    # equivalent to multiplying them by the d-spacings
    u = hkl * dspacing.unsqueeze(2)

    cosP = torch.einsum("bij,bj->bi", u, torch.einsum("bij,bj->bi",
                    reciprocal_lattice_metric_tensor, PO_axis))
    one_minus_cosP_sqd = 1.0-cosP**2
    one_minus_cosP_sqd[one_minus_cosP_sqd < 0.] *= 0.
    sinP = torch.sqrt(one_minus_cosP_sqd)

    # MD factor = 1 means no PO. Use a normal distribution with std given in the
    # argument centred at 1.
    factor = 1 + (torch.randn((batchsize,1),device=device,dtype=dtype) * factor_std)

    return cosP, sinP, factor, PO_axis

def apply_MD_PO_correction(intensities, cosP, sinP, factor):
    """Modifies the intensities to account for preferred orientation effects
    using the method of March and Dollase.

    Args:
        intensities (tensor):   Original intensities for the reflections.
            Shape = (batch, number of reflections)
        cosP (tensor):  Cosine of the angle between the Miller indices and the
            preferred orientation axis. Calculated in get_MD_PO_components
            Shape = (batch, number of reflections)
        sinP (tensor):  Sine of the angle between the Miller indices and the
            preferred orientation axis. Calculated in get_MD_PO_components
            Shape = (batch, number of reflections)
        factor (tensor): The March-Dollase factors. Shape = (batch,1)

    Returns:
        tensor: Modified intensities for the reflections given preferred
        orientation. Shape = (batch, number of reflections)
    """
    A_all = (1.0/torch.sqrt(((factor)*cosP)**2+sinP**2/(factor)))**3
    return intensities * A_all

Peak shapes

Now that we can calculate where the peaks will be, and modulate the intensity of the peaks if we want to, we now need to think about the shape of the peaks.

There are several possible peak shapes that we could go for, however, in order to try to get relatively realistic results, I decided to implement full Voigt functions, and also include the Finger, Cox and Jephcoat correction for peak asymmetry due to axial divergence. I also decided to implement peak broadening as a function of the peak position, using the well established U,V,W,X,Y,Z parameters described here and elsewhere.

We therefore need to calculate gaussian, lorentzian and the FCJ profile for each peak in the data, then convolve them together. Using the convolution theorem, we can do this by first Fourier transforming them, calculating their pointwise product in Fourier space, then inverse Fourier transforming the resultant product.

Show the code
def gaussian(x, mu, sig):
    """Calculate a gaussian peak

    Args:
        x (tensor): the x-coordinates for the peak. Shape = (datapoints)
        mu (tensor): the mean of the gaussian.
            Shape = (batch, number of peaks, 1)
        sig (tensor): the standard deviation of the gaussian.
            Shape = (batch, number of peaks, 1)

    Returns:
        tensor: the gaussian peaks centered at mu with standard deviation sig.
            Intensities scaled to max = 1.
            Shape = (batch, number of peaks, datapoints)
    """
    root_two_pi = 2.5066282746310002
    peak = (1/(sig*root_two_pi))*torch.exp(-0.5*(x-mu)**2/sig**2)
    return peak/peak.max(dim=-1).values.unsqueeze(2)


def lorentzian(x, loc, gam):
    """Calcualte a lorentzian peak

    Args:
        x (tensor): the x-coordinates for the peak. Shape = (datapoints)
        loc (tensor): The centre of the peaks.
            Shape = (batch, number of peaks, 1)
        gam (tensor): The half width at half max for the peaks.
             Shape = (batch, number of peaks, 1)

    Returns:
        tensor: the gaussian peaks centered at loc with HWHM gam.
            Intensities scaled to max = 1.
            Shape = (batch, number of peaks, datapoints)
    """
    one_over_pi = 0.3183098861837907
    peak = one_over_pi*(gam/((x - loc)**2+gam**2))
    return peak/peak.max(dim=-1).values.unsqueeze(2)


def fcj(data,twotheta,shl):
    """Finger, Cox and Jephcoat profile function.
    Code based on GSAS-II function found here:
    https://gsas-ii.readthedocs.io/en/latest/_modules/GSASIIpwd.html#fcjde_gen

    Args:
        data (tensor): the data to evaluate the function on.
            Shape = (batch, number of peaks, datapoints)
        twotheta (tensor): The position of the peak.
            Shape = (batch, number of peaks, 1)
        shl (tensor): shl = sum(S/L,H/L) where:
            S: sample height
            H: detector opening
            L: sample to detector opening distance
            Shape = (batch, 1)

    Returns:
        tensor: The Finger, Cox and Jephcoat profiles for the peaks
            Shape = (batch, number of peaks, datapoints)
    """
    pi_over_180 = 0.017453292519943295
    step = data[1] - data[0]
    T = step*data+twotheta
    abs_cos_T = torch.abs(torch.cos(T*pi_over_180))
    abs_cos_T_sqd = abs_cos_T**2
    cos_sqd_twotheta = torch.cos(twotheta*pi_over_180)**2
    cos_sqd_twotheta = torch.where(abs_cos_T_sqd>cos_sqd_twotheta,
                                    cos_sqd_twotheta,abs_cos_T_sqd)
    fcj_profile = torch.where(abs_cos_T_sqd>cos_sqd_twotheta,
                    (torch.sqrt(cos_sqd_twotheta/(abs_cos_T_sqd-cos_sqd_twotheta+1e-9))
                    -1./shl)/abs_cos_T,0.0)
    fcj_profile = torch.where(fcj_profile > 0.,fcj_profile,0.)
    # Sometimes the FCJ profile returned is all zeros. We don't want to separate
    # these out with loops etc, so need to deal with them. We'll replace zero entry
    # with an array like this ([1, 0, 0, ..., 0]) which gives ([1, 1, 1, ..., 1])
    # after the Fourier transform. Then then means that the pointwise multiplication
    # of the FT Gaussian and Lorentzian components can still occur unaffected
    zero_sum = (fcj_profile.sum(dim=-1) == 0).type(fcj_profile.dtype)
    fcj_profile[:,:,0] += zero_sum
    #first_zero = torch.zeros_like(x)
    #first_zero[0] = 1
    #batch_first_zero = (zero_sum * first_zero.unsqueeze(0))
    #peak_FCJ += batch_first_zero
    return fcj_profile

def get_UVWZ(batchsize, device, dtype, U_min=0.0001, U_max=0.0004,
            V_min=0.0001, V_max=0.0004, W_min=0.0001, W_max=0.0004,
            Z_min=0.0001, Z_max=0.0004):
    """
    Get parameters for Gaussian HWHM. Defaults should give reasonable data to
    resemble laboratory diffraction data
    """
    U = ((torch.rand(batchsize, device=device, dtype=dtype) * (U_max-U_min))
            + U_min).unsqueeze(1)

    V = ((torch.rand(batchsize, device=device, dtype=dtype) * (V_max-V_min))
            + V_min).unsqueeze(1)

    W = ((torch.rand(batchsize, device=device, dtype=dtype) * (W_max-W_min))
            + W_min).unsqueeze(1)

    Z = ((torch.rand(batchsize, device=device, dtype=dtype) * (Z_max-Z_min))
            + Z_min).unsqueeze(1)

    return U, V, W, Z

def get_XY(batchsize, device, dtype, X_min=0.001, X_max=0.035,
            Y_min=0.001,Y_max=0.035):
    """
    Get parameters for Lorentzian HWHM. Defaults should give reasonable data to
    resemble laboratory diffraction data
    """
    X = ((torch.rand(batchsize, device=device, dtype=dtype) * (X_max-X_min))
            + X_min).unsqueeze(1)

    Y = ((torch.rand(batchsize, device=device, dtype=dtype) * (Y_max-Y_min))
            + Y_min).unsqueeze(1)

    return X, Y

def get_hwhm_G(tan_twotheta, cos_twotheta, U, V, W, Z):
    """Calculate Gaussian HWHM as a function of peak position and U,V,W and Z params

    Args:
        tan_twotheta (tensor): tangent of the twotheta peak positions
        cos_twotheta (tensor): tangent of the twotheta peak positions
        U (tensor): peakshape parameter U
        V (tensor): peakshape parameter V
        W (tensor): peakshape parameter W
        Z (tensor): peakshape parameter Z

    Returns:
        tensor: HWHM for the gaussian peaks
            Shape = (batch, number of peaks)
    """
    tan_twotheta = tan_twotheta.squeeze()
    cos_twotheta = cos_twotheta.squeeze()
    return torch.sqrt((U * tan_twotheta**2) + (V * tan_twotheta)
                        + W + (Z/(cos_twotheta**2))).unsqueeze(2)

def get_hwhm_L(tan_twotheta, cos_twotheta, X, Y):
    """Calculate Lorentzian HWHM as a function of peak position and X and Y params

    Args:
        tan_twotheta (tensor): tangent of the twotheta peak positions
        cos_twotheta (tensor): tangent of the twotheta peak positions
        X (tensor): peakshape parameter X
        Y (tensor): peakshape parameter Y

    Returns:
        tensor: HWHM for the lorentzian peaks
            Shape = (batch, number of peaks)
    """
    tan_twotheta = tan_twotheta.squeeze()
    cos_twotheta = cos_twotheta.squeeze()
    return ((X * tan_twotheta) + (Y/cos_twotheta)).unsqueeze(2)

def get_shl(batchsize, device, dtype, shlmax=0.5, rescale=True):
    """Generate asymmetry parameter for the FCJ profile.

    Args:
        shl = sum(S/L,H/L) where:
            S: sample height
            H: detector opening
            L: sample to detector opening distance
        This is scaled by 1/57.2958 if rescale is True, in keeping with the scaling
        applied in the original GSAS-II code

    Returns:
        tensor: The asymmetry parameter for the FCJ profiles
    """
    shl = torch.rand((batchsize, 1, 1), device=device, dtype=dtype) * shlmax
    shl /= 57.2958
    return shl

Background and noise

The experimental noise can be modelled as gaussian noise, which we can add very easily to the calculated diffraction data. For the background shape, we can use Chebyshev polynomials.

Show the code
def get_noise(calculated_patterns, noise_min = 0.0001, noise_max = 0.0025):
    """Get noise for the diffraction patterns to simulate experimental data

    Args:
        calculated_patterns (tensor): The diffraction patterns.
            Shape = (batch, points in profile)
        noise_min (float, optional): Minimum standard deviation for the
            Gaussian noise. Defaults to 0.0001.
        noise_max (float, optional): Maximum standard deviation for the
            Gaussian noise. Defaults to 0.0025.

    Returns:
        tensor: Noise to be added to the diffraction patterns
    """
    batchsize = calculated_patterns.shape[0]
    device = calculated_patterns.device
    dtype = calculated_patterns.dtype
    noise_std = torch.rand((batchsize,1), device=device, dtype=dtype) * (noise_max - noise_min) + noise_min
    noise = torch.randn(calculated_patterns.shape, device=device, dtype=dtype) * noise_std
    return noise

def get_background(batchsize, data, bg_prm_max=0.025, bg_prm_min=0.0, degree=8):
    """Calculate background profiles using Chebyshev polynomials

    Args:
        batchsize (int): The batch size
        data (tensor): The twotheta values for the diffraction histograms
        bg_prm_max (float, optional): Maximum value for the standard deviation
            of the weights for the Chebyshev polynomial components. Defaults to
            0.025.
        bg_prm_min (float, optional): Minimum value for the standard deviation
            of the weights for the Chebyshev polynomial components. Defaults to
            0.0.
        degree (int, optional): The degree of Chebyshev polynomial to use for
            the backgrounds. Defaults to 8.

    Returns:
        tensor: Background profiles. Shape = (batch, number of points in histogram)
    """
    device = data.device
    dtype = data.dtype
    n = torch.arange(degree,device=device,dtype=dtype).unsqueeze(1)
    # Scale data into range -1 to +1
    ttstar = 2*(data - data.min())/(data.max() - data.min()) - 1
    chebyshev_basis = torch.cos(n*torch.arccos(ttstar))
    params = (((torch.rand((batchsize,1,1), device=device, dtype=dtype)
                * (bg_prm_max - bg_prm_min)) + bg_prm_min)
                * torch.randn((batchsize, chebyshev_basis.shape[0], 1),
                device=device, dtype=dtype))
    bg = (params * chebyshev_basis).sum(dim=1)
    return bg

A quick test to see if these functions work and generate sensible looking noise and backgrounds:

Show the code
import matplotlib.pyplot as plt

data = torch.linspace(4,44,2048).float()

batchsize = 4
bgs = get_background(batchsize, data, degree=10)
bgs -= bgs.min(dim=-1).values.unsqueeze(1)

noise = get_noise(bgs)

fig, ax = plt.subplots(1,1,figsize=(10,6))
for bg in (bgs+noise):
    ax.plot(data, bg)
ax.set_xlabel("$2\\theta$")
ax.set_title("Randomly generated backgrounds with noise")
plt.show()

Some funky background shapes are possible with randomly generated Chebyshev polynomials! We can also see varying levels of noise being added to the profiles.

We now need functions to tie everything together and start to generate some data

Show the code
def get_peak_positions(crystal_systems, hkl, intensities, unit_cells,
                    perturbation_stddev=0.05, zpemin=0.03, zpemax=0.03, wavelength=1.54056):
    batchsize = intensities.shape[0]
    dtype = intensities.dtype
    device = intensities.device

    cell_perturbation = get_unit_cell_perturbation(crystal_systems,
                                    dtype=dtype, stddev=perturbation_stddev)
    new_unit_cells = unit_cells + cell_perturbation
    lattice_matrix, valid = get_lattice_matrix(new_unit_cells)

    # Get rid of any invalid unit cells after perturbation
    if valid.sum() != valid.shape[0]:
        import warnings
        warnings.warn("Invalid cells generated")
        lattice_matrix = lattice_matrix[valid]
        hkl = hkl[valid]
        intensities = intensities[valid]
        batchsize = intensities.shape[0]
    reciprocal_lattice_matrix = torch.linalg.inv(lattice_matrix)
    reciprocal_lattice_metric_tensor = get_recip_lattice_metric_tensor(reciprocal_lattice_matrix)
    d_spacing = get_d_spacing(reciprocal_lattice_metric_tensor, hkl)
    zpe = get_zero_point_error(batchsize, device, dtype, zpemin=zpemin, zpemax=zpemax)
    twotheta = zpe + d_to_tt(d_spacing, wavelength)

    return twotheta, reciprocal_lattice_metric_tensor, hkl, intensities, d_spacing

def get_PO_intensities(hkl, reciprocal_lattice_metric_tensor, dspacing, intensities, PO_std=0.1):
    # Now apply PO perturbation to the peak intensities
    cosP, sinP, MDfactor, PO_axis = get_MD_PO_components(hkl,
                                    reciprocal_lattice_metric_tensor, dspacing, factor_std=PO_std)
    intensities = apply_MD_PO_correction(intensities, cosP, sinP, MDfactor)
    return torch.nan_to_num(intensities)

def get_peak_shape_params(twotheta, U_min=0.0001, U_max=0.0004,
                        V_min=0.0001, V_max=0.0004, W_min=0.0001, W_max=0.0004,
                        Z_min=0.0001, Z_max=0.0004, X_min=0.001, X_max=0.035,
                        Y_min=0.001, Y_max=0.035, shlmax=0.5):
    batchsize = twotheta.shape[0]
    dtype = twotheta.dtype
    device = twotheta.device
    tan_twotheta = torch.tan(twotheta*torch.pi/180.)
    cos_twotheta = torch.cos(twotheta*torch.pi/180.)
    U, V, W, Z = get_UVWZ(batchsize, device, dtype, U_min=U_min, U_max=U_max,
                        V_min=V_min, V_max=V_max, W_min=W_min, W_max=W_max,
                        Z_min=Z_min, Z_max=Z_max)
    X, Y = get_XY(batchsize, device, dtype, X_min=X_min, X_max=X_max,
                        Y_min=Y_min,Y_max=Y_max)
    hwhm_gaussian = get_hwhm_G(tan_twotheta, cos_twotheta, U, V, W, Z)
    hwhm_lorentzian = get_hwhm_L(tan_twotheta, cos_twotheta, X, Y)
    shl = get_shl(batchsize, device, dtype, shlmax=shlmax, rescale=True)
    return hwhm_gaussian, hwhm_lorentzian, shl

def calculate_peaks(x, twotheta, intensities, hwhm_gaussian, hwhm_lorentzian, shl):
    peak_G =  gaussian(x, torch.zeros_like(twotheta), hwhm_gaussian)
    peak_L = lorentzian(x, torch.zeros_like(twotheta), hwhm_lorentzian)
    peak_FCJ = fcj(x,twotheta, shl)
    peak_GLF = torch.stack([peak_G,peak_L,peak_FCJ],dim=1)
    prod_FT_GLF = torch.fft.fft(peak_GLF).prod(dim=1)
    peak_voigt = torch.fft.ifft(prod_FT_GLF).real
    zero_sum = peak_FCJ.sum(dim=-1) == 1
    peak_voigt[zero_sum.squeeze()] = torch.fft.ifftshift(peak_voigt[zero_sum.squeeze()], dim=-1)
    peak_voigt /= peak_voigt.max(dim=2).values.unsqueeze(2)
    peak_voigt *= intensities
    return peak_voigt

def calculate_full_patterns(x, full_data, twotheta, peak_voigt, ttmin=4., ttmax=44.):
    # Finally calculate the full diffraction pattern
    twotheta[twotheta == 0] = torch.inf
    twotheta[twotheta < 4] = torch.inf
    twotheta[twotheta > 44] = torch.inf
    peakidx = torch.abs((x[0] + twotheta) - full_data).min(dim=-1).indices
    full_pattern = torch.zeros(list(peakidx.shape)+[full_data.shape[0]], device=device, dtype=dtype)
    full_pattern = full_pattern.scatter_(2,
                    peakidx.unsqueeze(2) + torch.arange(x.shape[0], device=device), peak_voigt*torch.isfinite(twotheta))

    full_pattern = full_pattern.sum(dim=1)
    full_pattern /= full_pattern.max(dim=1).values.unsqueeze(1)
    full_pattern = full_pattern[:,(full_data >= ttmin) & (full_data <= ttmax)]
    return full_pattern

def calculate_diffraction_patterns(x, full_data, crystal_systems, hkl, intensities, unit_cells, wavelength=1.54056):
    """
    Expect the input tensors to have their first dimension to be of size batchsize
    """

    twotheta, reciprocal_lattice_metric_tensor, hkl, intensities, d_spacing = get_peak_positions(crystal_systems, hkl, intensities, unit_cells,
                    perturbation_stddev=0.05, zpemin=0.03, zpemax=0.03, wavelength=wavelength)

    twotheta = twotheta.unsqueeze(2)

    mod_intensities = get_PO_intensities(hkl, reciprocal_lattice_metric_tensor, d_spacing, intensities).unsqueeze(2)

    hwhm_gaussian, hwhm_lorentzian, shl = get_peak_shape_params(twotheta)

    peak_voigt = calculate_peaks(x, twotheta, mod_intensities, hwhm_gaussian, hwhm_lorentzian, shl)

    calculated_patterns = calculate_full_patterns(x, full_data, twotheta, peak_voigt, ttmin=ttmin, ttmax=ttmax)

    bgs = get_background(calculated_patterns.shape[0], 
                        full_data[(full_data >= ttmin) & (full_data <= ttmax)],
                        degree=10)
    noise = get_noise(calculated_patterns)

    calculated_patterns += bgs + noise
    calculated_patterns -= calculated_patterns.min(dim=1).values.unsqueeze(1)
    calculated_patterns /= calculated_patterns.max(dim=1).values.unsqueeze(1)

    return calculated_patterns

Let’s test the functions we’ve written using some real diffraction data as a comparison point. We’ll first plot the actual diffraction pattern, then plot several simulated datasets calculated from the unit cell, hkls and intensities extracted by DASH. This will show the effect of the data augmentation we’ve added into the code which modulates the positions, intensities and shapes of the peaks.

The dataset we’ll look at is that for verapamil hydrochloride. We’ll need a function to calculate the Lorentz-polarisation factor for the intensities, as this would have been corrected for in the intensities in the DASH hcv.

Show the code
v_xye = []
with open("./files/Verap.xye", "r") as f:
    for line in f:
        line = list(filter(None, line.strip().split(" ")))
        v_xye.append([float(x) for x in line])
v_xye = np.array(v_xye)
v_tt, v_i_obs = v_xye[:,0], v_xye[:,1]
v_i_obs /= v_i_obs.max()
Show the code
def lp_factor(twotheta, ttmono=torch.tensor([28])):
    A = torch.cos(ttmono*torch.pi/180)**2
    cossqdtt = torch.cos(twotheta*torch.pi/180)**2
    sintt = torch.sin(twotheta*torch.pi/180)
    return (1+A*cossqdtt) / ((1+A)*sintt)
Show the code
import matplotlib.pyplot as plt
device = torch.device("cpu")
dtype = torch.float32
ttmin = 4
ttmax = 44
peakrange = 3.
datadim = 2048
full_data = torch.linspace(ttmin-(peakrange/2), ttmax+(peakrange/2), int(np.ceil((ttmax-ttmin+peakrange)/((ttmax-ttmin)/datadim))), device=device, dtype=dtype)
plotdata = full_data[(full_data >= ttmin) & (full_data <= ttmax)].cpu()
x = (full_data[full_data <= ttmin+(peakrange/2)]).clone() - 4

v_hkl = []
v_i = []
with open("./files/Verap.hcv", "r") as f:
    for line in f:
        line = list(filter(None, line.strip().split(" ")))
        v_hkl.append([int(x) for x in line[0:3]])
        v_i.append(float(line[3]))
v_hkl = torch.tensor(v_hkl).float()
v_i = torch.tensor(v_i).float()
v_cell = torch.tensor([7.08991, 10.59464, 19.20684, 100.1068, 93.7396, 101.5610])
v_cs = torch.tensor([cs_key["Monoclinic"]])

repeat = 4
diffraction_patterns = calculate_diffraction_patterns(x, full_data, v_cs.repeat(repeat), v_hkl.repeat(repeat,1,1), v_i.repeat(repeat,1), v_cell.repeat(repeat,1))
plt.plot(v_tt[v_tt <= 35], v_i_obs[v_tt <= 35])
plt.title("Original verapamil hydrochloride diffraction data")
plt.xlabel("$2\\theta$")
plt.ylabel("Intensity")
plt.show()
fig, ax = plt.subplots(2,2,figsize=(16,8))
for i, p, in enumerate(diffraction_patterns):
    col = i % 2
    row = i // 2
    corrected_i = (lp_factor(plotdata[plotdata <= 35])*p[plotdata <= 35]).squeeze()
    corrected_i -= corrected_i.min()
    corrected_i /= corrected_i.max()
    ax[row][col].plot(plotdata[plotdata <= 35], corrected_i)
    ax[row][col].set_xlabel("$2\\theta$")
    ax[row][col].set_ylabel("Intensity")
fig.suptitle("Calculated diffraction patterns")
plt.show()

The calculated patterns clearly resemble the observed diffraction data, whilst also having subtle differences in the background, noise levels, peak shapes, peak asymmetries as well as the peak positions and intensities.

Performance

Let’s test the performance of the generation code on an Nvidia Quadro RTX 5000 GPU (16 GB VRAM). We’ve got about 288k patterns to generate, how long will it take?!

Show the code
data_dir = "./"
base_name = "4-44-CuKa1-data_4000_"
crystal_systems = torch.tensor(np.load(data_dir+base_name+"crystal_systems_numeric.npy"))
hkl             = torch.tensor(np.load(data_dir+base_name+"hkl.npy")).float()
intensities     = torch.tensor(np.load(data_dir+base_name+"intensities.npy")).float()
unit_cell       = torch.tensor(np.load(data_dir+base_name+"unit_cell.npy")).float()
cod_id         = np.load(data_dir+base_name+"cod_id.npy")
Show the code
from tqdm import tqdm

device = torch.device("cuda:0")
dtype = torch.float32
ttmin = 4
ttmax = 44
peakrange = 3.
datadim = 2048
full_data = torch.linspace(ttmin-(peakrange/2), ttmax+(peakrange/2), int(np.ceil((ttmax-ttmin+peakrange)/((ttmax-ttmin)/datadim))), device=device, dtype=dtype)
plotdata = full_data[(full_data >= ttmin) & (full_data <= ttmax)]
x = (full_data[full_data <= ttmin+(peakrange/2)]).clone() - ttmin

batchsize = 1000
idx = torch.arange(batchsize)

for i in tqdm(range(intensities.shape[0]//batchsize)):
    batch_idx = idx + (i*batchsize)
    diffraction_patterns = calculate_diffraction_patterns(
                                        x,
                                        full_data,
                                        crystal_systems[batch_idx].to(device),
                                        hkl[batch_idx].to(device),
                                        intensities[batch_idx].to(device),
                                        unit_cell[batch_idx].to(device)
                                    )
100%|██████████| 288/288 [00:57<00:00,  4.97it/s]

288k patterns generated in about a minute - this means we can generate about 5000 diffraction patterns per second!

I’m sure there are some tweaks that would boost the performance even more, but for now this will be good enough for initial experiments with neural networks. A quick check with a profiler could give some hints as to where the biggest bottlenecks in the code are:

Show the code
import cProfile as profile
import pstats

prof = profile.Profile()
prof.enable()
diffraction_patterns = calculate_diffraction_patterns(
                                        x,
                                        full_data,
                                        crystal_systems[batch_idx].to(device),
                                        hkl[batch_idx].to(device),
                                        intensities[batch_idx].to(device),
                                        unit_cell[batch_idx].to(device)
                                    )
prof.disable()
stats = pstats.Stats(prof).strip_dirs().sort_stats("tottime")
stats.print_stats(10) # top 10 rows
         294 function calls in 0.422 seconds

   Ordered by: internal time
   List reduced from 94 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.297    0.297    0.298    0.298 118587661.py:54(calculate_peaks)
        1    0.086    0.086    0.086    0.086 118587661.py:67(calculate_full_patterns)
        4    0.016    0.004    0.016    0.004 {method 'to' of 'torch._C._TensorBase' objects}
        1    0.015    0.015    0.016    0.016 1250221022.py:1(get_MD_PO_components)
        1    0.002    0.002    0.422    0.422 3294103451.py:6(<module>)
        1    0.002    0.002    0.004    0.004 118587661.py:1(get_peak_positions)
        1    0.000    0.000    0.000    0.000 {built-in method torch._C._linalg.linalg_inv}
        1    0.000    0.000    0.001    0.001 895306903.py:3(get_unit_cell_perturbation)
        1    0.000    0.000    0.405    0.405 118587661.py:82(calculate_diffraction_patterns)
        1    0.000    0.000    0.001    0.001 3032108489.py:41(fcj)

<pstats.Stats at 0x1b7da6a57c0>

The filenames look a bit odd as I’m running this in a notebook. However, it looks like the peak calculation is taking the most time, which isn’t too surprising to me as there’s a lot going on there! Perhaps in future work I’ll try to improve the efficiency of the code. For now though, I think it’s good enough for us to start experimenting with training neural networks, which will be the subject of future posts.

PowCod generated data

As a final sense check, let’s see some of the data we can generate from the PowCod database

Show the code
idx = torch.arange(4)
diffraction_patterns = calculate_diffraction_patterns(
                                        x,
                                        full_data,
                                        crystal_systems[idx].to(device),
                                        hkl[idx].to(device),
                                        intensities[idx].to(device),
                                        unit_cell[idx].to(device)
                                    )
fig, ax = plt.subplots(2,2,figsize=(16,10))
for i, d in enumerate(diffraction_patterns):
    col = i % 2
    row = i // 2
    ax[row][col].plot(plotdata.cpu().numpy(), d.cpu().numpy())
    ax[row][col].set_xlabel("$2\\theta$")
    ax[row][col].set_ylabel("Intensity")
    ax[row][col].set_title(f"{i+1} - COD entry {cod_id[idx[i]]}")
    print(f"{i+1} - http://www.crystallography.net/cod/{cod_id[idx[i]]}.html")
plt.show()
1 - http://www.crystallography.net/cod/7220223.html
2 - http://www.crystallography.net/cod/1553394.html
3 - http://www.crystallography.net/cod/7037807.html
4 - http://www.crystallography.net/cod/7036591.html

I downloaded the CIFs for each of the COD entries, and calculated their diffraction patterns using Mercury. I am very pleased that the patterns produced by the code in this post match the patterns produced by Mercury!

Conclusions

In this post, we’ve developed some code to allow us to read in unit cells, Miller indices, crystal system information and intensities, and rapidly generate realistic looking powder X-ray diffraction data on a GPU.

I’ve put all the functions written in this code into a github repository which can be found here - https://github.com/mspillman/powcodgen/ - please feel free to submit issues, pull requests etc if you have any ideas or suggestions for improvement!

In future posts, I’ll show how such data can be used to train artificial neural networks to do lots of interesting and useful things to help crystallographers understand their data.