## Lecture 13 (part 1) Thread Level Parallelism (6) EEC 171 Parallel Architectures John Owens UC Davis

## Credits

- © John Owens / UC Davis 2007–8.
- Thanks to many sources for slide material: Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Kathy Yelick / UCB 2007, © Wen-Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–7, © John Lazzaro / UCB 2006, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998.

## Outline

- Interconnection Networks
- Grab bag:
  - Amdahl's Law
  - Novices & Parallel Programming
  - Interconnect Technologies

#### **Preliminaries and Evolution**

- One switch suffices to connect a small number of devices
  - Number of switch ports limited by VLSI technology, power consumption, packaging, and other such <u>cost</u> constraints
- A *fabric* of interconnected switches (i.e., *switch fabric* or *network fabric*) is needed when the number of devices is much larger
  - The topology must make a path(s) available for every pair of devices—property of connectedness or full access (What paths?)
- Topology defines the connection structure across all components
  - Bisection bandwidth: the minimum bandwidth of all links crossing a network split into two roughly equal halves
  - Full bisection bandwidth:
    - > Network  $BW_{Bisection} =$  Injection (or Reception)  $BW_{Bisection} = N/2$
  - Bisection bandwidth mainly affects performance
- Topology is constrained primarily by local chip/board pin-outs; secondarily, (if at all) by global bisection bandwidth

#### **Centralized Switched (Indirect) Networks**

- Crossbar network
  - Crosspoint switch complexity increases quadratically with the number of crossbar input/output ports, N, i.e., grows as  $O(N^2)$
  - Has the property of being non-blocking



Centralized Switched (Indirect) Networks

- Multistage interconnection networks (MINs)
  - Crossbar split into several stages consisting of smaller crossbars
  - Complexity grows as  $O(N \times \log N)$ , where N is # of end nodes
  - Inter-stage connections represented by a set of permutation functions



Omega topology, perfect-shuffle exchange

#### Centralized Switched (Indirect) Networks



16 port, 4 stage Butterfly network

#### **Centralized Switched (Indirect) Networks**

- Reduction in MIN switch <u>cost</u> comes at the price of <u>performance</u>
  - Network has the property of being blocking
  - Contention is more likely to occur on network links
    - Paths from different sources to different destinations share one or more links



non-blocking topology



blocking topology

#### **Centralized Switched (Indirect) Networks**

- How to reduce blocking in MINs? <u>Provide alternative paths!</u>
  - Use larger switches (can equate to using more switches)
    - Clos network: minimally three stages (non-blocking)
      - » A larger switch in the middle of two other switch stages provides enough alternative paths to avoid all conflicts
  - Use more switches
    - > Add  $\log_k N$  1 stages, mirroring the original topology
      - » Rearrangeably non-blocking
      - » Allows for non-conflicting paths
      - » Doubles network hop count (distance), d
      - » Centralized control can rearrange established paths

Benes topology: 2(log<sub>2</sub>N) - 1 stages (rearrangeably non-blocking)

» Recursively applies the three-stage Clos network concept to the middle-stage set of switches to reduce all switches to 2 x 2

#### **Centralized Switched (Indirect) Networks**



Interconnection Networks: © Timothy Mark Pinkston and José Duato José Flich from contribution with major presentation

Interconnection Networks: © Timothy Mark Pinkston and José Duato

#### **Centralized Switched (Indirect) Networks**



Alternative paths from 0 to 1. 16 port, 7 stage Clos network = Benes topology

Interconnection Networks: © Timothy Mark Pinkston and José Duato

#### **Centralized Switched (Indirect) Networks**



Alternative paths from 4 to 0. 16 port, 7 stage Clos network = Benes topology

#### Myrinet-2000 Clos Network for 128 Hosts







 Backplane of the M3-E128 Switch
 M3-SW16-8F fiber line card (8 ports)

http://myri.com

#### **Distributed Switched (Direct) Networks**

- Bidirectional Ring networks
  - -N switches (3 × 3) and N bidirectional network links
  - Simultaneous packet transport over disjoint paths
  - Packets must hop across intermediate nodes
  - Shortest direction usually selected (N/4 hops, on average)



#### Distributed Switched (Direct) Networks:

- Fully connected and ring topologies delimit the two extremes
- The ideal topology:
  - Cost approaching a ring
  - Performance approaching a fully connected (crossbar) topology
- More practical topologies:
  - k-ary n-cubes (meshes, tori, hypercubes)
    - > k nodes connected in each dimension, with n total dimensions
    - > Symmetry and regularity
      - » network implementation is simplified
      - » routing is simplified

Timothy Mark Pinkston and José Duato José Flich

from

contribution

presentation

major

with

0

Interconnection Networks:

#### **Distributed Switched (Direct) Networks**



#### **Topological Characteristics of Commercial Machines**

|   | - 10A   |                                                      |                                          |                                                               |                                                      | - 10 A                                         |                                                             |                                                          |
|---|---------|------------------------------------------------------|------------------------------------------|---------------------------------------------------------------|------------------------------------------------------|------------------------------------------------|-------------------------------------------------------------|----------------------------------------------------------|
|   | Company | System<br>[Network] Name                             | Max.<br>number<br>of nodes<br>[x # CPUs] | Basic network topology                                        | Injection<br>[Recept'n]<br>node BW<br>in<br>MBytes/s | # of data<br>bits per<br>link per<br>direction | Raw<br>network link<br>BW per<br>direction in<br>Mbytes/sec | Raw<br>network<br>bisection<br>BW (bidir) in<br>Gbytes/s |
|   | Intel   | ASCI Red<br>Paragon                                  | 4,510<br>[x 2]                           | 2-D mesh<br>64 x 64                                           | 400<br>[400]                                         | 16 bits                                        | 400                                                         | 51.2                                                     |
|   | IBM     | ASCI White<br>SP Power3<br>[Colony]                  | 512<br>[x 16]                            | BMIN w/8-port<br>bidirect. switches (fat-<br>tree or Omega)   | 500<br>[500]                                         | 8 bits (+1<br>bit of<br>control)               | 500                                                         | 256                                                      |
|   | Intel   | Thunter Itanium2<br>Tiger4<br>[QsNet <sup>II</sup> ] | 1,024<br>[x 4]                           | fat tree w/8-port<br>bidirectional<br>switches                | 928<br>[928]                                         | 8 bits (+2<br>control for<br>4b/5b enc)        | 1,333                                                       | 1,365                                                    |
|   | Cray    | XT3<br>[SeaStar]                                     | 30,508<br>[x 1]                          | 3-D torus<br>40 x 32 x 24                                     | 3,200<br>[3,200]                                     | 12 bits                                        | 3,800                                                       | 5,836.8                                                  |
|   | Cray    | X1E                                                  | 1,024<br>[x 1]                           | 4-way bristled<br>2-D torus (~ 23 x 11)<br>with express links | 1,600<br>[1,600]                                     | 16 bits                                        | 1,600                                                       | 51.2                                                     |
|   | ІВМ     | ASC Purple<br>pSeries 575<br>[Federation]            | >1,280<br>[x 8]                          | BMIN w/8-port<br>bidirect. switches<br>(fat-tree or Omega)    | 2,000<br>[2,000]                                     | 8 bits (+2<br>bits of<br>control)              | 2,000                                                       | 2,560                                                    |
|   | ІВМ     | Blue Gene/L<br>eServer Sol.<br>[Torus Net]           | 65,536<br>[x 2]                          | 3-D torus<br>32 x 32 x 64                                     | 612,5<br>[1,050]                                     | 1 bit (bit<br>serial)                          | 175                                                         | 358.4                                                    |
| 1 |         | A ANY ANY                                            |                                          | XXIII XX                                                      |                                                      |                                                | A BARNEY                                                    |                                                          |

Timothy Mark Pinkston and José Duato José Flich from contribution Interconnection Networks: (C) major presentation .with

## Routing, Arbitration, and Switching

#### Routing

- Performed at each switch, regardless of topology
- Defines the "allowed" path(s) for each packet (Which paths?)
- Needed to <u>direct packets through network</u> to intended destinations
- Ideally:
  - Supply as many routing options to packets as there are paths provided by the topology, and evenly distribute network traffic among network links using those paths, minimizing contention
- Problems: situations that cause packets never to reach their dest.

#### - Livelock

- Arises from an unbounded number of allowed non-minimal hops
- > Solution: restrict the number of non-minimal (mis)hops allowed

#### – Deadlock

- Arises from a set of packets being blocked waiting only for network resources (i.e., links, buffers) held by other packets in the set
- Probability increases with increased traffic & decreased availability

## Routing, Arbitration, and Switching

#### Routing

- Common forms of deadlock:
  - Routing-induced deadlock

 $c_i$  = channel i  $s_i$  = source node i  $d_i$  = destination node i $p_i$  = packet i



"A Formal Model of Message Blocking and Deadlock Resolution in Interconnection Networks," S. Warnakulasuriya and T. Pinkston, IEEE Trans. on Parallel and Distributed Systems, Vol. 11, No. 3, pp. 212–229, March, 2000.

#### **On-Chip Networks (OCNs)**

|                |                                                             | l'and and a second |                                                                              |                                         |                                                        |                                                              |                                                                                        |                                                      |
|----------------|-------------------------------------------------------------|--------------------|------------------------------------------------------------------------------|-----------------------------------------|--------------------------------------------------------|--------------------------------------------------------------|----------------------------------------------------------------------------------------|------------------------------------------------------|
| 1              | Institution &<br>Processor [Network]<br>name                | Year<br>built      | Number of network ports<br>[cores or tiles + other<br>ports]                 | Basic network<br>topology               | # of data bits<br>per link per<br>direction            | Link bandwidth<br>[link clock<br>speed]                      | Routing;<br>Arbitration;<br>Switching                                                  | # of chip<br>metal layers;<br>flow control; #<br>VCs |
|                | MIT Raw [General<br>Dynamic Network]                        | 2002               | 16 port [16 tiles]                                                           | 2-D mesh<br>4 x 4                       | 32 bits                                                | 0.9 GBps [225<br>MHz, clocked at<br>proc speed]              | XY DOR w/ request-<br>reply deadlock<br>recovery; RR<br>arbitration;<br>wormhole       | 6 layers;<br>credit-based;<br>no VCs                 |
| ose Flich      | IBM POWER5                                                  | 2004               | 7 ports [2 PE cores + 5<br>other ports]                                      | Crossbar                                | 256 b Inst<br>fetch; 64 b<br>for stores;<br>256 b LDs  | [1.9 GHz,<br>clocked at proc<br>speed]                       | Shortest-path; non-<br>blocking; circuit<br>switch                                     | 7 layers;<br>handshaking;<br>no virtual<br>channels  |
| on trom JC     | U.T. Austin TRIPS<br>EDGE [Operand<br>Network]              | 2005               | 25 ports [25 execution<br>unit tiles]                                        | 2-D mesh<br>5 x 5                       | 110 bits                                               | 5.86 GBps [533<br>MHz clk scaled<br>by 80%]                  | XY DOR; distributed<br>RR arbitration;<br>wormhole                                     | 7 layers; on/<br>off flow<br>control; no<br>VCs      |
| contributio    | U.T. Austin TRIPS<br>EDGE [On-Chip<br>Network]              | 2005               | 40 ports [16 L2 tiles + 24<br>network interface tile]                        | 2-D mesh<br>10 x 4                      | 128 bits                                               | 6.8 GBps [533<br>MHz clk scaled<br>by 80%]                   | XY DOR; distributed<br>RR arbitration; VCT<br>switched                                 | 7 layers;<br>credit-based<br>flow control; 4<br>VCs  |
| presentation c | Sony, IBM, Toshiba<br>Cell BE [Element<br>Interconnect Bus] | 2005               | 12 ports [1 PPE and 8<br>SPEs + 3 other ports for<br>memory, I/&O interface] | Ring 4 total, 2<br>in each<br>direction | 128 bits data<br>(+16 bits tag)                        | 25.6 GBps [1.6<br>GHz, clocked at<br>half the proc<br>speed] | Shortest-path; tree-<br>based RR arb.<br>(centralized);<br>pipelined circuit<br>switch | 8 layers;<br>credit-based<br>flow control;<br>no VCs |
| najor prese    | Sun UltraSPARC T1<br>processor                              | 2005               | Up to 13 ports [8 PE<br>cores + 4 L2 banks + 1<br>shared I/O]                | Crossbar                                | 128 b both<br>for the 8<br>cores and the<br>4 L2 banks | 19.2 GBps [1.2<br>GHz, clocked at<br>proc speed]             | Shortest-path; age-<br>based arbitration;<br>VCT switched                              | 9 layers;<br>handshaking;<br>no VCs                  |
| E              |                                                             | NX.                | NV - ALAN                                                                    |                                         | X                                                      |                                                              | A CONTRACTOR                                                                           | NV -                                                 |

Timothy Mark Pinkston and José Duato **Jocá Flich** ontribution from 0 Interconnection Networks: econtation maior nr .with

#### **Cell Broadband Engine Element Interconnect Bus**

- Cell BE is successor to PlayStation 2's Emotion Engine
  - 300 MHz MIPS-based
  - Uses two vector elements
  - 6.2 GFLOPS (Single Precision)
  - 72KB Cache + 16KB Scratch Pad RAM
  - 240mm<sup>2</sup> on 0.25-micron process
- PlayStation 3 uses the Cell BE\*
  - 3.2 GHz POWER-based
  - Eight SIMD (Vector) Processor Elements
  - >200 GFLOPS (Single Precision)
  - 544KB cache + 2MB Local Store RAM
  - 235mm<sup>2</sup> on 90-nanometer SOI process

\*Sony has decided to use only 7 SPEs for the PlayStation 3 to improve yield. Eight SPEs will be assumed for the purposes of this discussion.

#### Cell Broadband Engine Element Interconnect Bus

- Cell Broadband Engine (Cell BE): 200 GFLOPS
  - 12 Elements (devices) interconnected by EIB:
    - One 64-bit Power processor element (PPE) with aggregate bandwidth of 51.2 GB/s
    - Eight 128-bit SIMD synergistic processor elements (SPE) with local store, each with a bandwidth of 51.2 GB/s
    - One memory interface controller (MIC) element with memory bandwidth of 25.6 GB/s
    - Two configurable I/O interface elements: 35 GB/s (out) and 25GB/s (in) of I/O bandwidth
  - Element Interconnect Bus (EIB):
    - Four unidirectional <u>rings</u> (two in each direction) each connect the heterogeneous 12 elements (end node devices)
    - > Data links: 128 bits wide @ 1.6 GHz; data bandwidth: 25.6 GB/s
    - > Provides coherent and non-coherent data transfer
    - Should optimize network traffic flow (throughput) and utilization while minimizing network latency and overhead



Timothy Mark Pinkston and José Duato José Flich from contribution Interconnection Networks: © major presentation .with

#### **Cell Broadband Engine Element Interconnect Bus**

- Element Interconnect Bus (EIB)
  - Packet size: 16B 128B (no headers); pipelined circuit switching
  - Credit-based flow control (command bus central token manager)
  - Two-stage, dual round-robin centralized network arbiter
  - Allows up to 64 outstanding requests (DMA)
    - > 64 Request Buffers in the MIC; 16 Request Buffers per SPE
  - Latency: 1 cycle/hop, transmission time (largest packet) 8 cycles
  - Effective bandwidth: peak 307.2 GB/s, max. sustainable 204.8 GB/s



#### Blue Gene/L 3D Torus Network

- 360 TFLOPS (peak)
- 2,500 square feet
- Connects 65,536 dual-processor nodes and 1,024 I/O nodes
  - One processor for computation; other meant for communication



#### Blue Gene/L 3D Torus Network



#### Blue Gene/L 3D Torus Network

- Main network: 32 x 32 x 64 3-D torus
  - Each node connects to six other nodes
  - Full routing in hardware
- Links and Bandwidth
  - 12 bit-serial links per node (6 in, 6 out)
  - Torus clock speed runs at 1/4th of processor rate
  - Each link is 1.4 Gb/s at target 700-MHz clock rate (175 MB/s)
  - High internal switch connectivity to keep all links busy
    - > External switch input links: 6 at 175 MB/s each (1,050 MB/s aggregate)
    - > External switch output links: 6 at 175 MB/s each (1,050 MB/s aggregate)
    - > Internal datapath crossbar input links: 12 at 175 MB/s each
    - > Internal datapath crossbar output links: 6 at 175 MB/s each
    - > Switch injection links: 7 at 175 MBps each (2 cores, each with 4 FIFOs)
    - Switch reception links: 12 at 175 MBps each (2 cores, each with 7 FIFOs)



## Encountering Amdahl's Law

• Speedup due to enhancement E is

# Speedup with $E = \frac{Exec time w/o E}{Exec time with E}$

 Suppose that enhancement E accelerates a fraction F (F < 1) of the task by a factor S (S>1) and the remainder of the task is unaffected

ExTime w/ E = ExTime w/o E × ((1 - F) + F/S)Speedup w/ E =  $\frac{1}{(1 - F) + F/S}$ 

## Challenges of Parallel Processing

- Application parallelism  $\Rightarrow$  primarily via new algorithms that have better parallel performance
- Long remote latency impact ⇒ both by architect and by the programmer
- For example, reduce frequency of remote accesses either by
  - Caching shared data (HW)
  - Restructuring the data layout to make more accesses local (SW)

## Examples: Amdahl's Law

Speedup w/ E = 1 / ((1-F) + F/S)

- Consider an enhancement which runs 20 times faster but which is only usable 25% of the time.
  - Speedup w/ E =
- What if it's usable only 15% of the time?
  - Speedup w/ E =
- Amdahl's Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!
- To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less

## Challenges of Parallel Processing

- Second challenge is long latency to remote memory
- Suppose 32 CPU MP, 2 GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.)
- What is performance impact if 0.2% instructions involve remote access?
  - 1.5X
  - 2.0X
  - 2.5X

## Challenges of Parallel Processing

- Application parallelism  $\Rightarrow$  primarily via new algorithms that have better parallel performance
- Long remote latency impact ⇒ both by architect and by the programmer
- For example, reduce frequency of remote accesses either by
  - Caching shared data (HW)
  - Restructuring the data layout to make more accesses local (SW)

## How hard is parallel programming anyway?

- Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers
  - Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, Victor Basili, Jeffrey K. Hollingsworth, Marvin V. Zelkowitz
  - Supercomputing 2005

## Why use students for testing?

- First, multiple students are routinely given the same assignment to perform, and thus we are able to conduct experiments in a way to control for the skills of specific programmers.
- Second, graduate students in a HPC class are fairly typical of a large class of novice HPC programmers who may have years of experience in their application domain but very little in HPC-style programming.
- Finally, due to the relatively low costs, student studies are an excellent environment to debug protocols that might be later used on practicing HPC programmers.

## Tests run

|                              | Serial     | MPI          | OpenMP | Co-Array<br>Fortran | StarP | XMT  |
|------------------------------|------------|--------------|--------|---------------------|-------|------|
| Nearest-Neighbor Type        | e Problems | 5            |        |                     |       |      |
| Game of Life                 | C3A3       | C3A3<br>C0A1 | C3A3   |                     |       |      |
|                              |            | C1A1         |        |                     |       |      |
| Grid of Resistors            | C2A2       | C2A2         | C2A2   |                     | C2A2  |      |
| Sharks & Fishes              |            | C6A2         | C6A2   | C6A2                |       |      |
| Laplace's Eq.                |            | C2A3         |        |                     | P2A3  |      |
| SWIM                         |            |              | C0A2   |                     |       |      |
| <b>Broadcast Type Proble</b> | ems        |              | •      | -                   | •     |      |
| LU Decomposition             |            |              | C4A1   |                     |       |      |
| Parallel Mat-vec             |            |              |        |                     | C3A4  |      |
| Quantum Dynamics             |            | C7A1         |        |                     |       |      |
| <b>Embarrassingly Parall</b> | el Type Pr | oblems       |        |                     |       |      |
| Buffon-Laplace Nee-          |            | C2A1         | C2A1   |                     | C2A1  |      |
| dle                          |            | C3A1         | C3A1   |                     | C3A1  |      |
| (Miscellaneous Problem       | n Types)   |              |        |                     |       |      |
| Parallel Sorting             |            | C3A2         | C3A2   |                     | C3A2  |      |
| Array Compaction             |            |              |        |                     |       | C5A1 |
| Randomized Selection         |            |              |        |                     |       | C5A2 |

## What They Learned

- Novices are able to achieve speedup on a parallel machine.
- MPI and OpenMP both require more {code, cost per line, effort} than serial implementations
  - MPI takes more effort than OpenMP

| Data                                               | Programming | Speedup on              |  |  |  |
|----------------------------------------------------|-------------|-------------------------|--|--|--|
| set                                                | Model       | 8 processors            |  |  |  |
| Speedup w.r.t. serial version                      |             |                         |  |  |  |
| C1A1                                               | MPI         | mean 4.74, sd 1.97, n=2 |  |  |  |
| C3A3                                               | MPI         | mean 2.8, sd 1.9, n=3   |  |  |  |
| C3A3                                               | OpenMP      | mean 6.7, sd 9.1, n=2   |  |  |  |
| Speedup w.r.t. parallel version run on 1 processor |             |                         |  |  |  |
| C0A1                                               | MPI         | mean 5.0, sd 2.1, n=13  |  |  |  |
| C1A1                                               | MPI         | mean 4.8, sd 2.0, n=3   |  |  |  |
| C3A3                                               | MPI         | mean 5.6, sd 2.5, n=5   |  |  |  |
| C3A3                                               | OpenMP      | mean 5.7, sd 3.0, n=4   |  |  |  |

## Ethernet Performance

- Achieves close to theoretical bw even with default 1500 B/ message
- Broadcom part: 31 µs latency



Message Size (Bytes)

Performance Characteristics of Dual-Processor HPC Cluster Nodes Based on 64-bit Commodity Processors, Purkayastha et al.

## Myrinet

- Interconnect designed with clustering in mind
- 2 Gbps links, possibly 2 physical links (so 4 Gb/s)
- \$850/node up to 128
   nodes, \$1737/node up to
   1024 nodes
- MPI latency 6–7 μs
- TCP/IP latency 27–30 μs
  - Strong support!



**Figure 2: Myrinet Performance** 

## Scalable Coherent Interface



- MPI: 4 µs latency, 1830
   Mbps
- TCP/IP: 900 Mbps



Message Size (Bytes)

**Figure 3: SCI Performance** 

## Quadrics

- "a premium interconnect choice on high-end systems such as the Compaq AlphaServer SC"
- Max 4096 ports
- \$2400/port for small
   \$2400/port for small
   1.0E+00
   1.0E+02
   1.0E+04
   1.0E+04
   1.0E+04
   Message Size (Bytes)
   system up to \$4078/portFigure 4: Quadrics Performance
   for 1024 node system
- MPI: 2-3 µs latency,
   6370 Mpbs bw



## Infiniband

- Defined by industry consortium
  - Scalable—base is 2.5 Gb/s link, scales to 30 Gb/s
  - Current parts are 4x (10 Gb/s)
- \$1200-1600/node
- Also used as system interconnect
- 6750 Mb/s, 6–7 μs



Message Size (Bytes)

**Figure 5: Infiniband Performance** 

## 10 Gb Ethernet

- Similar tradeoffs to previous Ethernets
- \$10k per switch port
- Only 3700 Mb/s



Message Size (Bytes)

## **Application Results**

| #node, 2Procs per                                                               | 1                | 2                | 4                                |   |
|---------------------------------------------------------------------------------|------------------|------------------|----------------------------------|---|
| * ≑ 56 <b>±H( H</b> ≰9FB9H                                                      | Ž                | fi               | fi                               |   |
| - MF=B9H- /,                                                                    | Ł                | fi               | fi                               |   |
| - MF=B9H2&/                                                                     | Ł                | fi               | fi                               |   |
| , B: =B=65B8 - /,                                                               | ł                | fl               | fi                               |   |
| 0158F=7G-/,                                                                     |                  | fl               | fi                               |   |
|                                                                                 |                  |                  |                                  |   |
| # node, 1 proc per                                                              | 1                | 2                | 4                                |   |
| <pre># node, 1 proc per * = 56+++( +&lt;9++9++9++++++++++++++++++++++++++</pre> | 1<br>Ž           | 2<br>Ł           | <b>4</b><br>fi                   |   |
|                                                                                 | 1<br>Ž<br>Ž      | 2<br>Ł<br>Ł      | <b>4</b><br>fi<br>fi             | _ |
| * ≑ 56+H( H≰9FB9H                                                               | 1<br>Ž<br>Ž<br>Ž | 2<br>と<br>と<br>と | <b>4</b><br>fi<br>fi<br>fi       | _ |
| * ≑ 56+H( H≮9FB9H<br>- MF=B9H- /,                                               | Ž                | 2<br>と<br>と<br>と | <b>4</b><br>fi<br>fi<br>fi<br>fi |   |

### **Table 1: GAMESS results**