Next: Results Up: CSE 557 Spring 2000 Previous: CSE 557 Spring 2000

Approach

Performance characteristics of two machines are sought in this work: COCOA and the CSE Pond 101 network of workstations. COCOA (COst effective COmputing Array) is a Penn State Department of Aerospace Engineering initiative to bring low cost parallel computing to the departmental level. COCOA is a 50 processor cluster of off the shelf PCs (Dual Pentium-II 400 Mhz with 512 MB RAM) connected via fast-ethernet (100 Mbit/sec) and runs Linux for the operating system. A single Baynetworks 24-port fast-ethernet switch with a backplane bandwidth of 2.5 Gbps is used for the networking. The machines in Pond 101 consists of 40-odd SUNSparc workstations (of varying speeds) connected via a switched 100 Mbit/sec network (fast ethernet) with two 24 hubs linked together through a single ethernet connection.

An explanation of the performation metrics that we are looking for, is explained below:

1.

Mflops: ``Mflops'' is a standard measure of performance for any processor used primarily for numerical computations, and is calculated by keeping track of all the floating point operations (additions, subtractions, multiplications and divisions) that are carried on during the execution of a program. These calculations are only counted when performed on float or double type variables, and not for integers (since integer arithmetic is relatively simpler and is handled in a different way by the processor). One Mflops stands for one million (10⁶) floating point operations per second. Mflops rating is indicative of a processor's performance for numerical computations, and is usually directly proportional to it's clock speed (for the same class of processors).

2.

Cache:

L1 cache: Level 1 (L1) Cache consists of high speed memory built into the processor. By using this cache, the processor can access frequently-requested data more quickly. The amount of Level 1 cache varies from processor to processor, and is not upgradable. L1 cache usually range from 8 KB to 128 KB.
L2 cache: Level 2 (L2) Cache is separate from the processor and it is upgradeable. It is an order of magnitude slower than the L1 cache but several times faster than the main memory (RAM). The Level 2 cache works in conjunction with the microprocessor's internal cache (L1) to provide maximum performance. The total amount of supported Level 2 Cache also varies from computer. L2 cache usually range from 128 KB to 4 MB.

Other hardware being the same, in most cases, a larger cache size usually (but not necessarily) leads to faster performance. When the problem fits entirely withing the cache size, the performance of the program can be several times faster than on a machine without cache. Thus, the knowledge of cache size is quite beneficial in predicting the performance of a specific numerical code.

3.

Message start-up time (t_s) or Latency: Latency, a synonym for delay, is an expression of how much time it takes for a packet of data to get from one designated point on the network to another. Another way of looking at it is to visualize it as the time taken to establish a connection between two points on a network, before any communication takes place. When there are several small messages being transmitted/exchanged on a network at different times (which is a farely common occurence for a lot of applications), latency is the biggest bottleneck for the performance.

4.

Incremental message cost (t_w): This is the time taken to transmit/receive every additional byte of information between two nodes in a network, once the communication is established between them. This is just another form of measuring communication speed, as its reciprocal gives the achievable communication bandwidth (i.e., the amount of data that can be transmitted in a fixed amount of time), which is a more commonly used performance metric.

Two small programs are written in ANSI C using the MPI (Message Passing Interface) libraries for message passing, thus making the implementation portable across all UNIX platforms. The first program barely uses MPI calls except for the timing routine (using MPI_Wtime()), and is used to determine the peak Mflops rating and the cache sizes of the processor. To measure the peak Mflops of the processor, this program does a simple set of floating point calculations of the form x_i = a y_i + b y_i+1 + c y_i+2 + d y_i+3 using loop unrolling to minimize the cost due to loop overhead. To measure the cache sizes, the arrays x_i and y_i are initially allocated a large number of elements (10⁶ in our case), and only a continuous subset of their elements are accessed in increasing order to determine the Mflops. Then, the discontinuities in the array size vs Mflops graph, if any, depict the cache sizes of the processor.

The second program uses MPI_Send()/MPI_Recv() calls in MPI to communicate between processors. The size of the message being communicated is varied in a loop to determine its effect on the communication time, and the same process is repeated for several pairs of processors communicating in parallel. The graph then reveals the latency and bandwidth for the network, and the effect due the increasing number of communicating pairs.

Both the programs are attached at the end of the report for perusal.

Next: Results Up: CSE 557 Spring 2000 Previous: CSE 557 Spring 2000

Anirudh Modi
2000-02-21