GLORIA — GEOMAR Library Ocean Research Information Access

Keywords: Science-Data processing. ; Technology-Data processing. ; Exascale computing. ; Electronic books.

Description / Table of Contents: This book focuses on the development of scalable and performance portable scientific applications for future exascale computer systems. It centers on programming practices of application developers to achieve scalability on high-end computer systems while maintain architectural and performance portability for different computer technologies.

Type of Medium: Online Resource

Pages: 1 online resource (607 pages)

Edition: 1st ed.

ISBN: 9781351999236

Series Statement: Chapman and Hall/CRC Computational Science Series

URL: https://ebookcentral.proquest.com/lib/geomar/detail.action?docID=5144934

DDC: 502.85

Language: English

Note: Cover -- Half Title Page -- Series Page -- Title Page -- Copyright Page -- Contents -- Foreword -- Preface -- About the Editors -- Contributors -- Chapter 1: Portable Methodologies for Energy Optimization on Large-Scale Power-Constrained Systems -- 1.1 Introduction -- 1.2 Background: How Architectures Drive the ASET Approach -- 1.3 The ASET Approach -- 1.3.1 Optimizing Per-Core Energy -- 1.3.2 Optimizing Power Allocation across a Parallel System -- 1.4 ASET Implementation -- 1.4.1 Example: Wave-Front Algorithms -- 1.4.2 Example: Load-Imbalanced Workloads -- 1.5 Case Study: ASETs versus Dynamic Load Balancing -- 1.5.1 Power Measurements and Analysis -- 1.6 Conclusions -- References -- Chapter 2: Performance Analysis and Debugging Tools at Scale -- 2.1 Introduction -- 2.2 Tool and Debugger Building Blocks -- 2.2.1 Hardware Performance Counters -- 2.2.2 Sampling -- 2.2.2.1 Event-Based Sampling -- 2.2.2.2 Instruction-Based Sampling -- 2.2.2.3 Data-Centric Sampling -- 2.2.3 Call Stack Unwinding -- 2.2.4 Instrumentation -- 2.2.4.1 Source-Code Instrumentation -- 2.2.4.2 Compiler-Based Instrumentation -- 2.2.4.3 Binary Instrumentation -- 2.2.5 Library Interposition -- 2.2.6 Tracing -- 2.2.7 GPU Performance Tools and Interfaces -- 2.2.8 MPI Profiling, Tools, and Process Acquisition Interfaces -- 2.2.9 OMPT-A Performance Tool Interface for OpenMP -- 2.2.10 Process Management Interface-Exascale -- 2.2.10.1 Architecture and Infrastructure -- 2.2.10.2 Requirements -- 2.3 Performance Tools -- 2.3.1 Performance Application Programming Interface -- 2.3.2 HPCToolkit -- 2.3.3 TAU -- 2.3.4 Score-P -- 2.3.5 Vampir -- 2.3.6 Darshan -- 2.4 Debugging Tools -- 2.4.1 Allinea DDT -- 2.4.2 The TotalView Debugger -- 2.4.2.1 Asynchronous Thread Control -- 2.4.2.2 Reverse Debugging -- 2.4.2.3 Heterogeneous Debugging -- 2.4.2.4 Architecture and Infrastructure. , 2.4.2.5 Multicast and Reduction -- 2.4.2.6 Debugger Requirements -- 2.4.3 Valgrind and Memory Debugging Tools -- 2.4.4 Stack Trace Analysis Tool -- 2.4.5 MPI and Thread Debugging -- 2.5 Conclusions -- References -- Chapter 3: Exascale Challenges in Numerical Linear and Multilinear Algebras -- 3.1 Introduction -- 3.2 Linear Algebra -- 3.2.1 Applications -- 3.2.2 Linear Algebra Operations: State of Practice -- 3.2.2.1 Dense Linear Algebra Operations -- 3.2.2.2 Sparse Linear Algebra Operations -- 3.2.3 Parallel and Accelerated Algorithms -- 3.2.3.1 Hardware Considerations -- 3.2.3.2 Dense Linear Algebra Algorithms -- 3.2.3.3 Sparse Linear Algebra Algorithms -- 3.2.4 Extreme Scale Issues -- 3.2.4.1 Higher Thread Count -- 3.2.4.2 Changing Memory Hierarchies -- 3.2.4.3 Communication Network Developments -- 3.2.4.4 Growing Resilience Concerns -- 3.2.5 Software -- 3.2.5.1 Third-Party Libraries -- 3.2.5.2 Vendor Libraries -- 3.2.6 Conclusion -- 3.3 Tensor Algebra -- 3.3.1 Tensors in Different Scientific Disciplines -- 3.3.2 Basic Tensor Algebra Operations -- 3.3.3 Tensor Decompositions and Higher Level Operations -- 3.3.4 Parallel Algorithms for Basic Tensor Operations -- 3.3.5 Extreme Scale Solutions -- 3.3.5.1 Projected Exascale Computing Hardware Roadmap -- 3.3.5.2 Hardware Abstraction Scheme and Virtual Processing -- 3.3.5.3 HPC Scale Abstraction -- 3.3.5.4 Hierarchical Task-Based Parallelism via Recursive Data Placement and Work Distribution -- 3.4 Conclusions -- Acknowledgment -- References -- Chapter 4: Exposing Hierarchical Parallelism in the FLASH Code for Supernova Simulation on Summit and Other Architectures -- 4.1 Background and Scientific Methodology -- 4.1.1 Type Ia Supernovae -- 4.1.2 Core-Collapse Supernovae -- 4.2 FLASH Algorithmic Details -- 4.2.1 Flash Physics Modules -- 4.2.2 Multiphysics Implementation -- 4.3 Programming Approach. , 4.3.1 Nuclear Burning Module -- 4.3.1.1 OpenMP Threading on Titan -- 4.3.1.2 GPU Optimization on Titan -- 4.3.2 EoS Module -- 4.4 Benchmarking Results -- 4.4.1 Nuclear Burning Module -- 4.4.1.1 OpenMP Threading -- 4.4.1.2 GPU Optimization on Titan -- 4.4.2 EoS Module -- 4.5 Summary -- Acknowledgments -- References -- Chapter 5: NAMD: Scalable Molecular Dynamics Based on the Charm++ Parallel Runtime System -- 5.1 Introduction -- 5.2 Scientific Methodology -- 5.3 Algorithmic Details -- 5.4 Programming Approach -- 5.4.1 Performance and Scalability -- 5.4.1.1 Dynamic Load Balancing -- 5.4.1.2 Topology Aware Mapping -- 5.4.1.3 SMP Optimizations -- 5.4.1.4 Optimizing Communication -- 5.4.1.5 GPU Manager and Heterogeneous Load Balancing -- 5.4.1.6 Parallel I/O -- 5.4.2 Portability -- 5.4.3 External Libraries -- 5.4.3.1 FFTW -- 5.4.3.2 Tcl -- 5.4.3.3 Python -- 5.5 Software Practices -- 5.5.1 NAMD -- 5.5.2 Charm++ -- 5.6 Benchmarking Results -- 5.6.1 Extrapolation to Exascale -- 5.6.1.1 Science Goals -- 5.6.1.2 Runtime System Enhancements Needed -- 5.6.1.3 Supporting Fine-Grain Computations -- 5.6.1.4 Optimizations Related to Wide Nodes -- 5.7 Reliability and Energy-Related Concerns -- 5.7.1 Fault Tolerance -- 5.7.2 Energy, Power, and Variation -- 5.7.2.1 Thermal-Aware Load Balancing -- 5.7.2.2 Speed-Aware Load Balancing -- 5.7.2.3 Power-Aware Job Scheduling with Malleable Applications -- 5.7.2.4 Hardware Reconfiguration -- 5.8 Summary -- Acknowledgments -- References -- Chapter 6: Developments in Computer Architecture and the Birth and Growth of Computational Chemistry -- 6.1 Introduction -- 6.2 Evolution of Computers and Their Use in Quantum Chemistry -- 6.3 Evolution of Quantum Chemistry Programs in the Early Years of Computational Chemistry -- References -- Chapter 7: On Preparing the Super Instruction Architecture and Aces4 for Future Computer Systems. , 7.1 Scientific Methodology -- 7.2 Algorithmic Details -- 7.3 Programming Approach -- 7.3.1 Aces4 and Domain Scientists -- 7.3.2 Aces4 System Development -- 7.3.2.1 Structure of the SIA -- 7.3.2.2 Workers -- 7.3.2.3 Servers -- 7.3.2.4 Load Balancing -- 7.3.2.5 Barriers -- 7.3.2.6 Exploiting GPUs -- 7.4 Scalability -- 7.5 Performance -- 7.6 Portability -- 7.7 External Libraries -- 7.8 Software Practices -- 7.9 Benchmark Results -- 7.10 Other Considerations -- 7.10.1 Fault Tolerance -- 7.11 Conclusion -- Acknowledgments -- References -- Chapter 8: Transitioning NWChem to the Next Generation of Manycore Machines -- 8.1 Introduction -- 8.2 Plane-Wave DFT Methods -- 8.2.1 FFT Algorithm -- 8.2.2 Nonlocal Pseudopotential and Lagrange Multiplier Algorithms -- 8.2.3 Overall Timings for AIMD on KNL -- 8.3 High-Level Quantum Chemistry Methods -- 8.3.1 Tensor Contraction Engine -- 8.3.2 Implementation for the Intel Xeon Phi KNC Coprocessor -- 8.3.3 Benchmarks -- 8.4 Large-Scale MD Methods -- 8.4.1 Domain Decomposition -- 8.4.2 Synchronization and Global Reductions -- 8.4.3 DSLs for Force and Energy Evaluation -- 8.4.4 Hierarchical Ensemble Methods -- 8.5 GAs Parallel Toolkit -- 8.6 Conclusions -- Acknowledgments -- References -- Chapter 9: Exascale Programming Approaches for Accelerated Climate Modeling for Energy -- 9.1 Overview and Scientific Impact of Accelerated Climate Modeling for Energy -- 9.2 GPU Refactoring of ACME Atmosphere -- 9.2.1 Mathematical Considerations and Their Computational Impacts -- 9.2.1.1 Mathematical Formulation -- 9.2.1.2 Grid -- 9.2.1.3 Element Boundary Averaging -- 9.2.1.4 Limiting -- 9.2.1.5 Time Discretization -- 9.2.2 Runtime Characterization -- 9.2.2.1 Throughput and Scaling -- 9.2.3 Code Structure -- 9.2.3.1 Data and Loops -- 9.2.3.2 OpenMP -- 9.2.3.3 Pack, Exchange, and Unpack. , 9.2.3.4 Bandwidth and Latency in MPI Communication -- 9.2.4 Previous Cuda Fortran Refactoring Effort -- 9.2.5 OpenACC Refactoring -- 9.2.5.1 Thread Master Regions -- 9.2.5.2 Breaking Up Element Loops -- 9.2.5.3 Flattening Arrays for Reusable Subroutines -- 9.2.5.4 Loop Collapsing and Reducing Repeated Array Accesses -- 9.2.5.5 Using Shared Memory and Local Memory -- 9.2.5.6 Optimizing the Boundary Exchange for Bandwidth -- 9.2.5.7 Optimizing the Boundary Exchange for Latency -- 9.2.5.8 Use of CUDA MPS -- 9.2.6 Optimizing for Pack, Exchange, and Unpack -- 9.2.7 Testing for Correctness -- 9.3 Nested OpenMP for ACME Atmosphere -- 9.3.1 Introduction -- 9.3.2 Algorithmic Structure -- 9.3.3 Programming Approach -- 9.3.4 Software Practices -- 9.3.5 Benchmarking Results -- 9.4 Portability Considerations -- 9.4.1 Breaking Up Element Loops -- 9.4.2 Collapsing and Pushing If-Statements Down the Callstack -- 9.4.3 Manual Loop Fissioning and Pushing Looping Down the Callstack -- 9.4.4 Kernels versus Parallel Loop -- 9.5 Ongoing Codebase Changes and Future Directions -- Acknowledgments -- References -- Chapter 10: Preparing the Community Earth System Model for Exascale Computing -- 10.1 Introduction -- 10.2 Background -- 10.2.1 CESM -- 10.2.2 Exascale Challenges and Expectations -- 10.3 Scientific Methodology -- 10.3.1 Performance Analysis -- 10.3.2 Kernel Extraction -- 10.3.3 Folding Analysis -- 10.3.4 Ensemble Verification -- 10.3.5 Platforms -- 10.4 Case Study: Dynamical Core -- 10.4.1 Algorithm Details -- 10.4.2 Parallelization Improvements -- 10.4.3 Single-Core Optimization -- 10.4.4 Benchmarking Results -- 10.5 Case Study: Data Analytics -- 10.6 Conclusions and Future Work -- Acknowledgments -- References -- Chapter 11: Large Eddy Simulation of Reacting Flow Physics and Combustion -- 11.1 Scientific Methodology -- 11.2 Algorithmic Details. , 11.3 Programming Approach.