Tài liệu A parallel implementation on modern hardware for geo electrical tomographical software

.PDF

273

tranbon Báo vi phạm

Tải xuống 77

Mô tả:

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Nguyễn Hoàng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin HÀ NỘI – 2010 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Nguyễn Hoàng Vũ A PARALLEL IMPLEMENTATION ON MODERN HARDWARE FOR GEO-ELECTRICAL TOMOGRAPHICAL SOFTWARE KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin Cán bộ hướng dẫn: PGS. TSKH. Phạm Huy Điển Cán bộ đồng hướng dẫn: TS. Đoàn Văn Tuyến HÀ NỘI – 2010 ABSTRACT Geo-electrical tomographical software plays a crucial role in geophysical research. However, imported software is expensive and does not provide much customizability, which is essential for more advanced geophysical study. Besides, these programs are unable to exploit the full potential of modern hardware, so the running time is inadequate for large-scale geophysical surveys. It is therefore an essential task to develop domestic software for overcoming all these problems. The development of this software is based on our research in using parallel programming on modern multi-core processors and stream processors for high performance computing. While this project with its inter-disciplinary aspect poses many challenges, it has also enabled us to gain valuable insights in making scientific software and especially the new field of personal supercomputing. INTRODUCTION 1 CHAPTER 1. HIGH PERFORMANCE COMPUTING ON MODERN HARDWARE 4 1.1 An overview of modern parallel architectures .................................................. 4 1.1.1 Instruction-Level Parallel Architectures 5 1.1.2 Process-Level Parallel Architectures 6 1.1.3 Data parallel architectures 8 1.1.4 Future trends in hardware 13 1.2 Programming tools for scientific computing on personal desktop systems ...... 15 1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++ 16 1.2.2 GPU programming with CUDA 22 1.2.3 Heterogeneous programming and OpenCL 27 CHAPTER 2. THE FORWARD PROBLEM IN RESISTIVITY TOMOGRAPHY 29 2.1 Inversion theory ............................................................................................. 31 2.2 The geophysical model ................................................................................... 33 2.3 The forward problem by differential method .................................................. 37 CHAPTER 3 SOFTWARE IMPLEMENTATION 41 3.1 CPU implementation ....................................................................................... 41 3.2 Example Results .............................................................................................. 43 3.3 GPU Implementation using CUDA ................................................................... 45 CONCLUSION 50 REFERENCES 51 List of Acronyms CPU CUDA GPU Central Processing Unit Compute Unified Device Architecture Graphical Processing Unit OpenMP Open Multi Processing OpenCL Open Computing Language TBB Intel Threading Building Blocks INTRODUCTION Geophysical methods are based on studying the propagation of the different physical fields within the earth’s interior. One of the most widely used fields in geophysics is the electromagnetic field generated by natural or artificial (controlled) sources. Electromagnetic methods comprise one of the three principle technologies in applied geophysics (the other two being seismic methods and potential field methods). There are many geo-electromagnetic methods currently used in the world. Of these electromagnetic methods, resistivity tomography is the most widely used and it is of major interest in our work. Resistivity tomography [17] or resistivity imaging is a method used in exploration geophysics [18] to measure underground physical properties in mineral, hydrocarbon, ground water or even archaeological exploration. It is closely related to the medical imaging technique called electrical impedance tomography (EIT), and mathematically is the same inverse problem. In contrast to medical EIT however, resistivity tomography is essentially a direct current method. This method is relatively new compared to other geophysical methods. Since the 1970s, extensive research has been done on the inversion theory for this method and it is still an active research field today. A detailed historical description can be seen in [27]. Resistivity tomography surveys searching for oil and gas (left) or water (right) 1 Resistivity tomography has the advantage of being relatively easy to carry out with inexpensive equipment and therefore has seen widespread use all over the world for many decades. With the increasing computing power of personal computers, inversion software for resistivity tomography has been made, most notably being Res2Dinv by Loke [5]. According to geophysicists at Institute of Geology (Vietnam Academy of Science and Technology), the use of imported resistivity software encountered the following serious problems:  The user interface is not user-friendly;  Some computation steps cannot be modified to adapt to measurement methods used in Vietnam;  With large datasets, the computational power of modern hardware is not fully exploit;  High cost for purchasing and upgrading software. Resistivity software is a popular tool for both short term and long term projects in research, education and exploration by Vietnamese geophysicists. Replacing imported software is therefore essential not only to reduce cost but also to enable more advance research on the theoretical side, which requires custom software implementations. The development of this software is based on research in using modern multi-core processors and stream processors for scientific software. This can also be the basis for solving larger geophysical problems on distributed systems if necessary. Our resistivity tomographical software is an example of applying high performance computing on modern hardware to computational geoscience. For 2-D surveys with small datasets, sequential programs still provide results in acceptable time. Parallelizing for these situations provides faster response time and therefore increases research productivity but is not a critical feature. However, for 3-D surveys, datasets are much larger with high computational expenses. A solution for this situation is using clusters. Clusters, however, are not a feasible option for many scientific institutions in Vietnam. Clusters are expensive with high power consumption. With limited availability only in large institutions, getting access to clusters is also inconvenient. Clusters are not suitable for field trip as well because of 2 difficulties in transportation and power supply. Exploiting the parallel capabilities of modern hardware is therefore a must to enable cost-effective scientific computing on desktop systems for such problems. This can help reduce hardware cost, power consumption and increase user convenience and software development productivity. These benefits are especially valuable to scientific software customers in Vietnam where cluster deployment is costly in both money and human resources. 3 Chapter 1 High Performance Computing on Modern Hardware 1.1 An overview of modern parallel architectures Computer speed is crucial in most software, especially scientific applications. As a result, computer designers have always looked for mechanisms to improve hardware performance. Processor speed and packaging densities have been enhanced greatly over the past decades. However, due to the physical limitations of electronic components, other mechanisms have been introduced to improve hardware performance. According to [1], the objectives of architectural acceleration mechanisms are to  decrease latency, the time from start to completion of an operation;  increase bandwidth, the width and rate of operations. Direct hardware implementations of expensive operations help reduce execution latency. Memory latency has been improved with larger register files, multiple register sets and caches, which exploit the spatial and temporal locality of reference in the program. With the bandwidth problem, the solutions can be classified into two forms of parallelism: pipelining and replication. Pipelining [22] divides an operation into different stages to enable the concurrent execution of these stages for a stream of operations. If all of the stages of the pipeline are filled, a new result is available every unit of time it takes to complete the slowest stage. Pipelines are used in many kinds of processors. In the picture below, a generic pipeline with four stages is shown. Without pipelining, four instructions take 16 clock cycles to complete. With pipelining, this is reduced to just 8 clock cycles. On the other hands, replication duplicates hardware components to enable concurrent execution of different operations. Pipelining and replication appear at different architectural levels and in various forms complementing each other. While numerous, these architectures can be divided into three groups [1]:  Instruction-Level Parallel (Fined-Grained Control Parallelism)  Process-Level Parallel (Coarse-Grained Control Parallelism) 4  Data Parallel (Data Parallelism) These categories are not exclusive of each other. A hardware device (such as the CPU) can belong to all these three groups. 1.1.1 Instruction-Level Parallel Architectures There are two common kinds of instruction-level parallel architecture. The first is superscalar pipelined architectures which subdivide the execution of each machine instruction into a number of stages. As short stages allow for high clock frequencies, the recent trend is to use longer pipeline. For example the Pentium 4 uses a 20-stage pipeline and the latest Pentium 4 core contains a 31-stage pipeline. Figure 1 Generic 4-stage pipeline; the colored boxes represent instructions independent of each other [21]. A common problem with these pipelines is branching. When branches happen, the processor has to wait until the branch finishes fetching the next instruction. A branch prediction unit is put into the CPU to guess which branch would be executed. However, if branches are predicted poorly, the performance penalty can be high. Some programming techniques to make branches in code more predictable for hardware can be found in [2]. Programming tools such as Intel VTune Performance Analyzer can be of great help in profiling programs for missed branch predictions. The second kind of instruction-level parallel architecture is VLIW (very long instruction word) architectures. A very long instruction word usually controls 5 to 30 replicated execution units. An example of VLIW architecture is the Intel Itanium processor [23]. As of 2009, Itanium processors can execute up to six instructions per 5 cycle. For ordinary architectures, superscalar execution and out-of-order execution is used to speed up computing. This increases hardware complexity. The processor must decide at runtime whether instruction parts are independent so that they can be executed simultaneously. In VLIW architectures, this is decided at compile time. This shifts the hardware complexity to software complexity. All operations in one instruction must be independent so efficient code generation is a hard task for compilers. The problem of writing compilers and porting legacy software to the new architectures make the Itanium architecture unpopular. 1.1.2 Process-Level Parallel Architectures Process-level parallel architectures are architectures that exploit coarse-grained control parallelism in loops, functions or complete programs. They replicate complete asynchronously executing processors to increase execution bandwidth and, hence, fit the multiple-instruction-multiple-data (MIMD) paradigm. Until a few years ago, these architectures comprised of multiprocessors and multicomputers. A multiprocessor uses a shared memory address space for all processors. There are two kinds of multiprocessors:  Symmetric Multiprocessor or SMP computers: the cost of accessing an address in memory is the same for each processor. Furthermore, the processors are all equal in the eyes of the operation system.  Non-uniform Memory Architecture or NUMA computers: the cost of accessing a given address in memory varies from one processor to another. In a multicomputer, each processor has its own local memory. Access to remote memory requires explicit message passing over the interconnection network. They are also called distributed memory architectures or message-passing architectures. An example is cluster system. A cluster consists of many computing nodes, which can be built using high-performance hardware or commodity desktop hardware. All the nodes in a cluster are connected via Infiniband or Gigabit Ethernet. Big clusters can have thousands of nodes with special topologies for interconnect. Cluster is currently the only affordable way for large scale supercomputing at the level of hundreds of teraflops or more. 6 Figure 2 Example SMP system (left) and NUMA system (right) A recent derivative of cluster computing is grid computing [19]. While traditional clusters often consist of similar nodes close to each other, grids will incorporate heterogeneous collections of computers, possibly distributed geographically. They are, therefore, optimized for workloads containing many independent packets of work. The two biggest grid computing network is Folding@home and SETI@home (BOINC). Both have the computing capability of a few petaflops while the most powerful traditional cluster can barely reach over 1 petaflops. Figure 3 Intel CPU trends [12]. 7 The most notable change to process-level parallel architectures happened in the last few years. Figure 3 shows that although the number of transistors a CPU contains still increases according to Moore’s law (which means doubling every 18 months), the clock speed has virtually stopped rising due to heating and manufacturing problems. CPU manufacturers have now turned to adding more cores to a single CPU while the clock speed stays the same or decreases. An individual core is a distinct processing element and is basically the same as a CPU in an older single-core PC. A multi-core chip can now be considered a SMP MIMD parallel processor. A multi-core chip can run at lower clock speed and therefore consumes less power but still has increases in processing power. The latest Intel Core i7-980 (Gulftown) CPU has 6 cores and 12 MB of cache. With hyper-threading it can support up to 12 hardware threads. Future multi-core CPU generations may have 8, 16 or even 32 cores in the next few years. These new architectures, especially in multi-processor node, can provide the level of parallelism that has been only available to cluster systems. Figure 4 Intel Gulftown CPU . 1.1.3 Data parallel architectures Data parallel architectures appeared very soon on the history of computing. They utilize data parallelism to increase execution bandwidth. Data parallelism is common in many scientific and engineering tasks where a single operation is applied to a whole data set, usually a vector or a matrix. This allows applications to exhibit a large amount of independent parallel workloads. Both pipelining and replication have been applied to hardware to utilize data parallelism. 8 Pipelined vector processors such as the Cray 1 [15], operates on vectors rather than scalar. After the instruction is decoded, vectors of data stream directly from memory into the pipelined functional units. Separate pipelines can be chained together to get higher performance. The translation of sequential code into vector instructions is called vectorization. A vectorizing compiler played a crucial role in programming for vector processors. This has significantly pushed the maturity of compilers in generating efficient parallel code. Through replication, processor arrays can utilize data parallelism as a single control unit can order a large number of simple processing elements to operate the same instruction on different data elements. These massively parallel supercomputers fit into the single-instruction-multiple-data (SIMD) paradigm. Although both of the kinds of supercomputers mentioned above have virtually disappeared from common use, they are precursors for current data parallel architectures, most notably the CPU SIMD processing and GPUs. The CPU SIMD extension instruction set for Intel CPUs include MMX, SSE, SSE2, SSE3, SSE4 and AVX. They allow the CPU to use a single operation to operate on several data elements simultaneously. AVX, the latest extension instruction set is expected to be implemented on both Intel and AMD products in 2010 and 2011. With AVX, the size of SIMD vector register is increased from 128-bit to 256-bit, which means the CPU can operate on 8 single-precision or 4 double-precision floating point numbers during one instruction. CPU SIMD processing has been used widely by programmers in many applications such as multimedia and encryption and compiler code generation for these architectures are now considerably good. Even when multicore CPUs are popular, understanding SIMD extensions is still vital for optimizing program execution on each CPU core. A good handbook on utilizing software vectorization is [1]. However, graphics processing units (GPUs) are perhaps the hardware with the most dramatic growth in processing power over the last few years. Graphics chips started as fixed function graphics pipelines. Over the years, these graphics chips became increasingly programmable with newer graphics API and shaders. In the 1999-2000 timeframe, computer scientists in particular, along with researchers in fields such as medical imaging and electromagnetic started using GPUs for running general purpose computational applications. They found the excellent 9 floating point performance in GPUs led to a huge performance boost for a range of scientific applications. This was the advent of the movement called GPGPU or General Purpose computing on GPUs. With the advent of programming languages such as CUDA and OpenCL, GPUs are now easier to program. With the processing power of a few Teraflops, GPUs are now massively parallel processors at a much smaller scale. They are now also termed stream processors as data is streamed directly from memory into the execution units without the latency like the CPUs. As can be seen in Figure 5, GPUs have currently outpaced CPUs many times in both speed and bandwidth. Figure 5 Comparison between CPU and GPU speed and bandwidth (CUDA programming Guide) [8]. 10 The two most notable GPU architectures now are the ATI Radeon 5870 (Cypress) and Nvidia GF100 (Fermi) processor. The Radeon 5870 processor has 20 SIMD engines, each of which has 16 thread processors inside of it. Each of those thread processors has five arithmetic logic units, or ALUs. With a total of 1600 stream processors and a clock speed of 850 MHz, Radeon 5870 has the single-precision computing power of 2.72 Tflops while top of the line CPU still has processing power counted in Gflops. Double-precision computing is done at one fifth of the rate for single-precision, at 544 Gflops. This card supports both OpenCL and DirectCompute. The double version, the Radeon 5970 (Hemlock) dual graphics processor has a single-precision computing power of 4.7 Tflops in a graphics card at a thermal envelope of less than 300 W. Custom over clocked versions made by graphics card manufacturer can even offer much more computing power than the original version. Figure 6 ATI Radeon 5870 (Cypress) graphics processor The Nvidia GF100 processor has 3 billion transistors with 15 SM (Shader Multiprocessor) units, each has 32 shader cores or CUDA processor compared to 8 of 11 previous Nvidia GPUs. Each CUDA processor has a fully pipelined integer arithmetic logic unit and floating point unit with better standard conformance and fused multiplyadd instruction for both single and double precision. The integer precision was raised from 24 bit to 32 bit so multi-instruction emulation is no longer required. Special function units in each SM can execute transcendental instructions such as sin, cosine, reciprocal and square root. Figure 7 Nvidia GF100 (Fermi) processor with parallel kernel execution 12 Single-precision performance of GF100 is about 1.7 Tflops but double-precision performance is only half at 800 Gflops, significantly better than the Radeon 5870. Previous architectures required that all SMs in the chip worked on the same kernel (function/program/loop) at the same time. In this generation the GigaThread scheduler can execute threads from multiple kernels in parallel. This chip is specifically designed to provide better support for GPGPU with memory error correction, native support for C++ (including virtual functions, function pointers, dynamic memory management using new and delete and exception handling), and compatible with CUDA, OpenCL and DirectCompute. A true cache hierarchy with two levels is added with more shared memory than previous GPU generations. Context switching and atomic operations are also faster. Fortran compilers are also available from PGI. Specific versions for scientific computing will have from 3GB to 6GB GDDR5. 1.1.4 Future trends in hardware Although the current parallel architectures are very powerful, especially for parallel workload, they won’t stay the same way in the future. From the current situation, we can present some trends for future hardware in the next few years. The first is the change in the composition of clusters. A cluster node can now have several multicore processors and some graphics processors. Consequently, clusters with fewer nodes can still have the same processing power. This also enables the maximum limit of cluster processing capabilities to increase. Traditional clusters consisting of only CPU nodes have virtually reached their peak at about 1 Pflops. Adding more nodes would result in more system overhead with marginal increase in speed. Electricity consumption is also enormous for such systems. Supercomputing now accounts for 2 percents of the total electric consumption of the entire United States. Building supercomputer at the exascale (1000 Pflops) using traditional clusters is too much costly. Graphics processors or similar architectures provide a good Gflops/W ratio and are, therefore, vital to building supercomputers with larger processing power. The IBM Roadrunner supercomputer [21] using Cell processors is a clear example for this trend. The second trend is the convergence of stream processors and CPUs. Graphics cards currently act the role of co-processors to the CPU in floating point intensive tasks. In the long term, all the functionalities of the graphics card may reside on the CPU, just like what happened in the case of math co-processors which 13 are now CPU floating point units. The Cell processor by Sony, Toshiba and IBM is heading towards that direction. AMD has also been continuously pursuing this with its Fusion project. The Nvidia GF100 is a GPU with many CPU features such as memory correction and large caches. The Intel’s Larrabee experiment project event went further by aiming to produce an x86-compatible GPU that would later be integrated into Intel CPUs. These would all lead to a new kind of processor called Accelerated Processing Unit (APU). The third trend is the evolution of multicore CPUs into many-core processors in which individual cores form a cluster system. In December 2009, Intel unveiled the newest product of its Terascale Computing Research program, a 48-core x86 processor. Figure 8 The Intel 48-core processor. To the right is a dual-core tile. The processor has 24 such tiles in a 6 by 4 layout. It represents the sequel to Intel's 2007 Polaris 80-core prototype that was based on simple floating point units. This device is called a "Single-chip Cloud Computer" (SCC). The structure of the chip resembles that of a cluster with cores connected through a message-passing network with 256 GB/s bandwidth. Shared-memory is simulated on software. Cache coherence and power management is also softwarebased. Each core can run its own OS and software, which resembles a cloud computing center. Each tile (2 cores) can have its own frequency, and groupings of four tiles (8 cores) can each run at their own voltage. The SCC can run all 48 cores at one time over a range of 25W to 125W and selectively vary the voltage and frequency 14 of the mesh network as well as sets of cores. This 48 core device consists of 1.3 billion transistors produced using 45nm high-k metal gate. Intel are currently handing out these processors to its partners in both industry and academy to enhance further research in parallel computing. Tilera corporation is also producing processors with one hundred cores. Each core can run a Linux OS independently. The processor also has Dynamic Distributed Cache technology which provides a fully coherent shared cache system across an arbitrary sized array of tiles. Programming can be done normally on a Linux derivative with full support for C and C++ and Tilera parallel libraries. The processor utilizes VLIW (Very Long Instruction Word) with RISC instructions for each core. The primary focus of this processor is for networking, multimedia and clouding computing with a strong emphasis on integer computation to complement GPU’s floating point computation. From all these trends, it would be reasonable to assume that in the near future, we will be able to see new architectures which resemble all current architectures, such as many-core processors where each core has a CPU core and stream-processors as coprocessors. Such systems would provide tremendous computing power per processor that would cause major changes in the field of computing. 1.2 Programming tools for scientific computing on personal desktop systems Traditionally, most scientific computing tasks have been done on clusters. However, with the advent of modern hardware that provide great level of parallelism, many small to medium-sized tasks can now be run on a single high-end desktop computer in reasonable time. Such systems are called “personal supercomputers”. Although they have variable configurations, most today employ multicore CPUs with multiple GPUs. An example is the Fastra II desktop supercomputer [3] at University of Antwep, Belgium, which can achieve 12 Tflops computing power. The FASTRA II contains six NVIDIA GTX295 dual-GPU cards, and one GTX275 single-GPU card with a total cost of less than six thousands euros. The real processing speed of this system can equal that of a cluster with thousands of CPU cores. Although these systems are more cost-effective, consume less power and provide greater convenience for their users, they pose serious problems for software developers. 15

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất