Looking at Figure 1, it is clear that the L1 cache on one of the machines in Pond 101, melmak.cse.psu.edu, is 32 KB, since we get a sharp discontinuity around that point. The other discontinuity, although not so sharp, leads us to believe that the L2 cache is most likely 1 MB (i.e., 1024 KB) as the performance starts to deteriorate much more sharply beyond that point [Note: The cache sizes are usually a power of 2, hence we eliminate predictions such as 900 KB or 1100 KB]. The plot also reveals the peak processing speed of melmak to be around 175 Mflops, which is obtained when the problem fits entirely in the L1 cache. Once the problem becomes large enough to not fit in either the L1 or L2 cache, the processing speed goes down drastically and becomes a constant 84 Mflops.
Similarly, looking at Figure 2, we find that the L1 and
L2 cache sizes for the COCOA server (Intel Pentium II Xeon 450 Mhz
processor) are 16 KB and 1 MB respectively, while that for the the COCOA
client nodes (Intel Pentium II 400 Mhz processors) are 16 KB and 512
KB respectively. A possible explanation for the performance drop-off
around 16 KB, which is well below the Intel specified size of 32 KB
for both the CPUs, is because the stated L1 cache actually consists of
two parts: 16 KB of data cache and 16 KB of instruction cache. For our
application, it is the data cache that is relevant (and is measured),
as the instruction cache is primarily utilized for the CPU instructions
(as the name clearly states). The peak processing speed in this case
is seen to be 255 Mflops for the Pentium II Xeon 450 Mhz processor and
225 Mflops for the Pentium II 400 Mhz processor (note:
as
,
which shows that the speeds are
directly proportional to their clock speeds). Once the problem size
becomes large enough to not fit in either of the cache, the processing
speed drops down to a constant of about 70 Mflops in both the cases.
Relating these performance figures to the relaxation problem that was discussed in Assignment #1, we can say that the peak Mflops rate is only acheivable for the problem if the total data size of the problem fits entirely in the L1 data cache of the processor (which is 32 KB for some of the Pond Lab machines, and 16 KB for COCOA). Assuming that we are running the second version of the relaxation algorithm, which uses only a single k x k array of type double (8 bytes), we can fit a problem with a grid as large as 45 x 45 in 16 KB of L1 data cache, and 64 x 64 in 32 KB of L1 data cache.
The output from the second program (run on COCOA) is used to generate the plots in Figures 3 and 4. Figure 3 plots the communication time vs message size for all the processor pairs as a scatter and fits two straight lines on the data, one each for the two different slopes clearly seen in the plot. The discontinuity is seen at around a message size of 1500 bytes, which can be easily explained by the fact that the Maximum Transfer Unit (MTU) set on the ethernet cards for each of the nodes on COCOA is also 1500 bytes. Thus, messages smaller than 1500 bytes often end up leaving holes in the packet, thus decreasing the bandwidth. From the figure, the start-up time (ts: y-intercept of the line) for messages upto 1500 bytes is noted as 181.6
,
and that for
messages larger than 1500 bytes is seen to be 275.2
.
The
incremental message cost (tw: slope of the line) for message smaller
than 1500 bytes is seen to be 0.1509
/byte, corresponding to a
bandwidth of 53.02 Mbits/sec. For messages greater than or equal to 1500
bytes, the incremental message cost goes down to 0.0918
/byte,
corresponding to a bandwidth of 87.15 Mbits/sec. The run on the
Pond Lab machines could not be completed due to the unavailability
of sufficient resources and the lack of time, but a preliminary
analysis showed that the start-up time of its network was approximately
240
for small message sizes.
From figure 4, we can see that although the net
communication bandwidth (i.e., sum of communication bandwidths of
each communicating pair of processors) increases with the number of
processors, it is not exactly linear. When the number of simultaneaously
communicating processor pairs becomes large, the communication time
between every pair increases, as the backplane bandwidth of the switch
gets saturated, and the messages can no longer be communicated at the
same speed. It can be clearly seen in the figure, that the 12 processor
case is slower by about 50
as compared to the 2 or 6 processor
case. Once the number of communicating pair becomes sufficiently large,
the net communication bandwidth will be solely dictated by the backplane
bandwidth of the switch (which is 2.4 Gbps in the case of COCOA), and
can in no circumstances exceed that.