Next: Results
Up: CSE 557 Spring 2000
Previous: CSE 557 Spring 2000
Performance characteristics of two machines are sought in this work:
COCOA and the CSE Pond 101 network of workstations. COCOA (COst
effective COmputing Array) is a Penn State Department of Aerospace
Engineering initiative to bring low cost parallel computing to the
departmental level. COCOA is a 50 processor cluster of off the shelf PCs
(Dual Pentium-II 400 Mhz with 512 MB RAM) connected via fast-ethernet
(100 Mbit/sec) and runs Linux for the operating system. A single
Baynetworks 24-port fast-ethernet switch with a backplane bandwidth of
2.5 Gbps is used for the networking. The machines in Pond 101 consists
of 40-odd SUNSparc workstations (of varying speeds) connected via a
switched 100 Mbit/sec network (fast ethernet) with two 24 hubs linked
together through a single ethernet connection.
An explanation of the performation metrics that we are looking
for, is explained below:
- 1.
- Mflops: ``Mflops'' is a standard measure of performance
for any processor used primarily for numerical computations, and is
calculated by keeping track of all the floating point operations
(additions, subtractions, multiplications and divisions) that are
carried on during the execution of a program. These calculations
are only counted when performed on float or double type
variables, and not for integers (since integer arithmetic is relatively
simpler and is handled in a different way by the processor). One Mflops
stands for one million (106) floating point operations per second.
Mflops rating is indicative of a processor's performance for numerical
computations, and is usually directly proportional to it's clock
speed (for the same class of processors).
- 2.
- Cache:
- L1 cache:
Level 1 (L1) Cache consists of high speed memory built into
the processor. By using this cache, the processor can access
frequently-requested data more quickly. The amount of Level 1 cache
varies from processor to processor, and is not upgradable. L1 cache
usually range from 8 KB to 128 KB.
- L2 cache:
Level 2 (L2) Cache is separate from the processor and it is upgradeable.
It is an order of magnitude slower than the L1 cache but several
times faster than the main memory (RAM). The Level 2 cache works in
conjunction with the microprocessor's internal cache (L1) to provide
maximum performance. The total amount of supported Level 2 Cache also
varies from computer. L2 cache usually range from 128 KB to 4 MB.
Other hardware being the same, in most cases, a larger cache size
usually (but not necessarily) leads to faster performance. When the
problem fits entirely withing the cache size, the performance of the
program can be several times faster than on a machine without cache.
Thus, the knowledge of cache size is quite beneficial in predicting the
performance of a specific numerical code.
- 3.
- Message start-up time (ts) or Latency:
Latency, a synonym for delay, is an expression of how much time it takes
for a packet of data to get from one designated point on the
network to another. Another way of looking at it is to visualize it
as the time taken to establish a connection between two points on a
network, before any communication takes place. When there are several
small messages being transmitted/exchanged on a network at different
times (which is a farely common occurence for a lot of applications),
latency is the biggest bottleneck for the performance.
- 4.
- Incremental message cost (tw):
This is the time taken to transmit/receive every additional byte of
information between two nodes in a network, once the communication
is established between them. This is just another form of measuring
communication speed, as its reciprocal gives the achievable
communication bandwidth (i.e., the amount of data that can be
transmitted in a fixed amount of time), which is a more commonly used
performance metric.
Two small programs are written in ANSI C using the MPI
(Message Passing Interface) libraries for message passing, thus making
the implementation portable across all UNIX platforms. The first
program barely uses MPI calls except for the timing routine (using MPI_Wtime()), and is used to determine the peak Mflops rating and
the cache sizes of the processor. To measure the peak Mflops of the
processor, this program does a simple set of floating point calculations
of the form
xi = a yi + b yi+1 + c yi+2 + d yi+3 using loop
unrolling to minimize the cost due to loop overhead. To measure the
cache sizes, the arrays xi and yi are initially allocated a large
number of elements (106 in our case), and only a continuous subset of
their elements are accessed in increasing order to determine the Mflops.
Then, the discontinuities in the array size vs Mflops graph, if any,
depict the cache sizes of the processor.
The second program uses MPI_Send()/MPI_Recv() calls
in MPI to communicate between processors. The size of the message
being communicated is varied in a loop to determine its effect
on the communication time, and the same process is repeated for
several pairs of processors communicating in parallel. The graph then
reveals the latency and bandwidth for the network, and the effect due
the increasing number of communicating pairs.
Both the programs are attached at the end of the report for perusal.
Next: Results
Up: CSE 557 Spring 2000
Previous: CSE 557 Spring 2000
Anirudh Modi
2000-02-21