Numerical simulations play a critical role in modern scientific research. Theory department scientists develop and apply state-of-the-art numerical codes for solving outstanding problems in fusion energy science and plasma-based technology. These calculations are extremely demanding and require the use of massively parallel computers. Experts from the Computational Plasma Physics Group (CPPG) work in close collaboration with research physicists to develop and implement modern computational techniques and optimizations that allow these numerical applications to run with high performance and parallel efficiency on the leading supercomputers. Advanced visualization of simulation and experimental results is also a very important aspect of the scientific discovery process and CPPG provides support and expertise in this field as well.

Modern computer architectures are highly hierarchical, consisting of several levels of memory hardware of different sizes and speeds to supply data to the powerful many-core processors that execute the numerical instructions of the code. A whole system is composed of compute nodes linked together by a high-speed interconnect that transmits the data that need to be shared between nodes during a parallel computation. Keeping tens or hundreds of thousands of cores busy at all times is a very difficult task that requires a deep understanding of parallel computing, hardware, programming models, algorithms, compilers, optimizations, communication technologies, etc.

The CPPG experts in high performance computing and applied mathematics develop and maintain all computational aspects of PPPL's flagship codes. The methods employed cover, among others: parallel PDE solvers using the PETSc library, shared memory parallelism with OpenMP, distributed parallelism with MPI, GPU programming with CUDA and OpenACC, and other state-of-the-art algorithms tailored to each specific application. CPPG members collaborate with computer scientists, applied mathematicians, and visualization experts from other institutions to develop new algorithms and methods for PPPL's codes, with the goal of achieving the highest performance on new and future architectures.

The race for exascale computing brings about a whole new challenge in terms of unprecedented concurrencies (> 1 billion compute threads) and change in programming models. It is important to keep up with the technology and continuously update our computational applications. This allows researchers to simulate new parameter regimes that were previously inaccessible. When combined with advanced data analysis and visualization, these capabilities lead to new scientific discoveries.

Thousands of scientists across all continents collaborate on researching and developing Magnetic Fusion Energy. They perform experiments on large plasma confinement devices, share and distribute data, and run numerical simulations to explore, understand, and predict the behavior of fusion plasmas. At experiments, increasingly sophisticated plasma diagnostics probe fusion plasmas with ever increasing spatial and temporal resolution. While this allows scientists to better understand the complex multi-scale physics phenomena that occur within a plasma, this comes at a cost of ever larger datasets. Today's fusion experiments generate several Gigabytes of measurement data per second. Moving to the ITER [1] scale, several thousands Gigabytes of measurement data will be produced on a given day. Analyzing such large data with traditional data analysis methods turns out to be challenging.

In collaboration with experimentalists, computational scientists, and HPC engineers, PPPL researchers are developing a federated data analysis framework that is performant enough to reliably transport high velocity data streams around the globe, flexible enough to facilitate the wide range of workflows that occur in fusion energy research and adaptable to automatically adjust to network congestion and data anomalies. In its current state, DELTA, the aDaptive nEar-reaL Time Analysis framework allows to consistently stream measurement data from KSTAR in Korea to Cori, to Cori, a Cray XC-40 supercomputer at the National Energy Energy Research Scientific Computing (NERSC) Centre in California with over 500 MByte/sec. There, Delta has been used to facilitate a standard spectral data analysis workflow in less than five minutes on the several Gigabyte large dataset, a runtime more than 100 times faster than when performed on a standard single-core implementation [3]. The figure illustrates how Delta uses the ADIOS2 I/O system to orchestrate workflows on distributed resources and how it connects to data visualization dashboards.

[1] ITER, (ITER website)

[2] R. Kube, M. Churchill

[3] W. F. Godoy, N. Podhorszki

To study magnetically-confined plasmas in next-generation tokamaks, such as ITER [1], with so-called “total-$f$”, gyrokinetic codes, such as XGC [2], it is crucial that the codes scale well in the “weak” scaling sense (i.e. for a bigger problem size with proportionally more particles).
The gyrokinetic equation [2] is
$df/dt$ $=$ $\partial f/\partial t$ $+$ $\dot{\bf z} \cdot \partial f / d {\bf z}$ $+$ $C(f) + S(f)$,
where $f({\bf z},t)$ is the particle-distribution function, $t$ is time, and ${\bf z}$ the vector of position and momenta; $C(f)$ is a collision operator; and $S(f)$ describes sources/sinks. (Total-$f$ codes solve this *without* the scale-separation assumption leading to so-called “$\delta f$” codes.)
XGC scales *excellently* on leadership-class computers, such as Titan, the Department of Energy's (DoE) supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), the fastest, open-science computer in USA, with a peak performance of 27 *quadrillion*, floating-point operations/second!
The figure (click to enlarge) shows nearly-ideal, weak scaling to the *maximal*, heterogeous Titan capability, using 1 message-passing interface (MPI) and 16 OpenMP threads/node (black line).

Close collaboration between applied mathematicians and computer scientists since the operation of Titan (in 2013) have significantly improved XGC's scalability.
The purple line is the imperfect, MPI-only scaling achieved in the early Titan, with a limited number of compute nodes; and the green line shows how the 16 OpenMP threadings (one thread/core) improved the scaling performance.
The red line shows that the scaling becomes *near perfect* all the way to the maximal node number by utilizing 16 OpenMP threads/node.
The black line shows that the computing speed becomes 2.4$\times$ faster when the graphical-processing units (GPUs) are utilized together with the CPUs, still keeping the near-perfect scalability.
The excellent scaling properties of XGC opens new areas of research into magnetically-confined plasmas that would otherwise be impossible; such as global, nonlinear, “blobby” ITER edge turbulence studies, in realistic diverted geometry, with gyrokinetic ions, drift kinetic electrons and neutral particle recycling [3].

[1] ITER, (ITER website)

[2] S. Ku, R. Hager

[3] C.S. Chang, 2016 IAEA Fusion Eenergy Conference, Invited Talk

Ever-increasingly, modern scientific advances are facilitated by computational firepower; and exploiting advances in both computational hardware *and* software is essential for continued progress in fusion energy research.
For example: fusion reactor size and cost is determined by the balance between loss processes due to turbulent transport and self-heating rates from fusion reactions, and these can only be determined computationally.
Since 1998, the “GTC” gyro-kinetic, particle-in-cell code [0] has been at the forefront of exploiting increasingly advanced computational capabilities, and was the first Fusion Energy Science (FES) code to deliver production run simulations at the tera-flop level in 2002, and peta-flop in 2009!

Hardware | #cpus | teraflops | #P | #time | Discovery | |

1998 | Cray T3E | $10^2$ | $10^{-1}$ | $10^8$ | $10^4$ | ion-turbulence zonal flow [1]; |

2002 | IBM SP | $10^3$ | $10^{0}$ | $10^{9}$ | $10^4$ | ion-transport size scaling [2]; |

2007 | Cray XT3/4 | $10^4$ | $10^{2}$ | $10^{10}$ | $10^5$ | electron turbulence [3]; EP transport [4] |

2009 | Cray XT5 | $10^5$ | $10^{3}$ | $10^{10}$ | $10^5$ | electron transport scaling [5]; EP-driven modes; |

2012- | Cray XK7/Titan | $10^5$ | $10^{4}$ | $10^{11}$ | $10^5$ | kinetic-MHD [6]; turbulence+EP+MHD; TAE [7]; |

2018 | Exascale | ? | $10^{6}$ | $10^{12}$ | $10^6$ | turbulence, EP, MHD, RF; |

At every stage, algorithmic advances were required to achieve the new level of performance and scalability, with increasing number of particles (#P), for increasing number of time-steps (#time), that enabled each scientific discovery: 1-D domain decomposition with Message Passing Interface (MPI) enabled study of ion-turbulence zonal flows [1]; loop-level shared memory parallelism with OpenMP enabled ion-transport size scaling up to an ITER-size tokamak [2]; and implementation of an MPI particle distribution, a PETSc-based parallel solver, and GPU algorithms with CUDA enabled the other scientific advances in energetic particle (EP) physics, toroidal Alfvén eigenmodes (TAE) physics, etc. as shown in the table. Future progress, into the exascale realm, will require algorithmic and solver advances (e.g. 2-D domain decomposition) which will only be realized in an interdisciplinary, “co-design” environment working with computer scientists and applied mathematicians.

[0] Z. Lin, T.S. Hahm

[1] Z. Lin, S. Ethier

[2] Z. Lin, T.S. Hahm

[3] Z. Lin, I. Holod

[4] Wenlu Zhang (张文禄), Zhihong Lin (林志宏) & Liu Chen (陈骝), Phys. Rev. Lett.

[5] Yong Xiao & Zhihong Lin, Phys. Rev. Lett.

[6] Zhixuan Wang (王直轩), Zhihong Lin (林志宏)

[7] H.S. Zhang (张桦森), Z. Lin (林志宏) and I. Holod, Phys. Rev. Lett.