5,000,000,000,000,000 bytes from Villigen to Lugano
During investigations of tiny structures with large-scale research facilities, huge amounts of data accumulate at the Paul Scherrer Institute PSI. This data is archived at the CSCS supercomputer centre in Lugano. "Piz Daint" is one of the supercomputers located there – researchers use it for their simulations and modelling.
At the X-ray free-electron laser SwissFEL in Villigen, a tiny protein crystal in a toothpaste-like mass flows slowly out of an injector. A laser hits it and sets off movements in the molecule. It changes its structure – a bit like a cat arching its back. One-trillionth of a second later, a pulse of X-ray light penetrates the sample and hits a detector. With that, the structural change in the protein is recorded, as it were, photographically. The protein thus imaged is photosensitive rhodopsin, which occurs for example in the retina of the human eye. Its structural change is the starting point for the transmission of light stimuli to the brain.
In the experimental setup, 25 X-ray light pulses per second hit the protein crystal in the viscous mass. The pulses last only a femtosecond, one-quadrillionth of a second, and they have an extremely high density of photons. This allows high-resolution imaging of molecular structures. In the end, the many individual images create a kind of flipbook movie of the protein's movements. "With such precise film shooting comes massive growth in the mountain of data", according to Leonardo Sala, head of High-Performance Computing at PSI. The imaging of rhodopsin protein crystals delivered a raw data volume of around 250 terabytes. That's about a thousand times the storage capacity of a typical commercially available laptop.
It's not only at SwissFEL, but also at other large-scale research facilities such as the Swiss Light Source SLS or the neutron source SINQ, that advances in accelerator and detector technology lead to boosts in performance that in turn generate still more data when experiments are carried out. PSI currently produces up to five petabytes of data annually. That corresponds to roughly the storage capacity of one million DVDs.
Where to go with so much data?
PSI's computing centre is not designed for these quantities of data. Therefore since 2018 the archiving of data has been done at the supercomputing centre Centro Svizzero di Calcolo Scientifico (CSCS) in Lugano. The so-called petabyte archive was developed through close collaboration of colleagues at PSI and CSCS. Computer experts from the two institutions developed a management process through which the digital information can be compressed, securely transmitted, archived, and retrieved as well as deleted after the archiving period of at least five years has elapsed. A specially developed fibre-optic cable link between PSI and CSCS is used to transfer data at the rate of ten gigabytes per second.
No one expects this flood of data to end. With the upgrade of SLS to SLS 2.0, a great many more bits and bytes will be produced in the future. "We are currently working on a procedure to reduce and compress this volume of data", Sala says. Special algorithms are designed to sort the data coming from the detectors so that only information relevant to research will be stored. "During the measurement of proteins at SLS, less than 20 percent of the X-ray pulses hit a protein and produce a usable image." Signals that will not yield a result do not require resource-intensive storage.
What sounds so simple is, in reality, an enormous challenge. "Teaching a computer to distinguish which measurements are unusable is very difficult", Sala admits. But that is only the first step towards curbing the glut of data. After automated sorting, IT specialists can achieve a tenfold reduction of data volume by storing not raw data, but only information that has been processed for its end use.
Activating the robot in Lugano from Villigen
At CSCS in Lugano, the protein research group's measurements are relocated to a so-called tape library. Stored in a rack are around 3,600 data tapes similar to the magnetic tapes that were used, decades ago, for videocassettes. "To start with, we have ten petabytes of storage available in the tape library. The big advantage of the collaboration with CSCS is that we can upgrade it as needed", Sala says. By 2022, PSI plans to transfer around 85 petabytes to CSCS for archiving.
To start with, we have ten petabytes of storage available in the tape library. We can upgrade it as needed
Storing data is one thing; retrieving it from the archive is something completely different. That is why a catalogue specially developed for the purpose lists where specific information can be found. Researchers can simply browse through this catalogue as needed and remotely activate, from Villigen, a robot that picks out the appropriate tapes, puts them into a computer drive, and initiates the data transfer to PSI.
The collaboration with CSCS, however, goes far beyond the pure archiving of research results. "We've been using the supercomputer at CSCS for 15 years", says Andreas Adelmann, who heads an accelerator modelling and advanced simulations group at PSI. That's because the researchers need enormously high computing power for simulations and modelling of large-scale research facilities and experiments, for example in materials and life sciences. They find this in Piz Daint at CSCS, one of the most powerful supercomputers in the world. In 1941 the first practical, freely programmable computer, the Z3, barely carried out two additions per second. Today, the computing power of Piz Daint is 25,000 petaflops per second: that means 25 quadrillion arithmetic operations, 14,000 times faster than a Playstation 4 graphics card.
In principle, modelling and simulation are required for nearly all research at PSI, whether to understand how cracks propagate in materials or to study fuel cell components.
Particle accelerators such as the proton acceleration cyclotron, SLS, or SwissFEL are not only newly constructed, but also further developed and optimised with the help of simulations. In addition, researchers can calculate how an experiment is likely to proceed, in order to identify possible problems in the experimental setup.
And there is yet another reason why researchers gladly and in good conscience send their data to Lugano for calculations and archiving: Since 2013, Piz Daint has been the most cost-effective and energy-efficient petaflop supercomputer in the world, since no energy-intensive refrigeration system is consuming electricity to keep it cool. The water of Lake Lugano prevents the electronic superbrain at CSCS from running hot. Cold water at around 6 degrees Celsius is drawn from a depth of 45 metres and then, after use, returned to the lake at a depth of 12 metres. In the process, the potential energy of the water due to this height difference is also used, with the help of turbines, to generate electricity.
Text: Christina Bonanati