Next: Conclusions Up: CSE 557 Spring 2000 Previous: Approach

Results

The output from the first program is used to generate the plots in Figures 1 and 2. The programs were compiled using the best possible optimization flags for the given architecture (i.e, mpicc -fast on the Pond Lab machines, and mpicc -O6 -mpentiumpro -funroll-loops on COCOA). They were run at the highest priority (using the nice -19 command in UNIX), as otherwise, a lot of noise is generated in the results owing to the constant swapping in and out of the processes due to the multitasking nature of the operating system.

Looking at Figure 1, it is clear that the L1 cache on one of the machines in Pond 101, melmak.cse.psu.edu, is 32 KB, since we get a sharp discontinuity around that point. The other discontinuity, although not so sharp, leads us to believe that the L2 cache is most likely 1 MB (i.e., 1024 KB) as the performance starts to deteriorate much more sharply beyond that point [Note: The cache sizes are usually a power of 2, hence we eliminate predictions such as 900 KB or 1100 KB]. The plot also reveals the peak processing speed of melmak to be around 175 Mflops, which is obtained when the problem fits entirely in the L1 cache. Once the problem becomes large enough to not fit in either the L1 or L2 cache, the processing speed goes down drastically and becomes a constant 84 Mflops.

Similarly, looking at Figure 2, we find that the L1 and L2 cache sizes for the COCOA server (Intel Pentium II Xeon 450 Mhz processor) are 16 KB and 1 MB respectively, while that for the the COCOA client nodes (Intel Pentium II 400 Mhz processors) are 16 KB and 512 KB respectively. A possible explanation for the performance drop-off around 16 KB, which is well below the Intel specified size of 32 KB for both the CPUs, is because the stated L1 cache actually consists of two parts: 16 KB of data cache and 16 KB of instruction cache. For our application, it is the data cache that is relevant (and is measured), as the instruction cache is primarily utilized for the CPU instructions (as the name clearly states). The peak processing speed in this case is seen to be 255 Mflops for the Pentium II Xeon 450 Mhz processor and 225 Mflops for the Pentium II 400 Mhz processor (note: $255/450 \approx 225/400$ as $0.567 \approx 0.563$ , which shows that the speeds are directly proportional to their clock speeds). Once the problem size becomes large enough to not fit in either of the cache, the processing speed drops down to a constant of about 70 Mflops in both the cases.

Relating these performance figures to the relaxation problem that was discussed in Assignment #1, we can say that the peak Mflops rate is only acheivable for the problem if the total data size of the problem fits entirely in the L1 data cache of the processor (which is 32 KB for some of the Pond Lab machines, and 16 KB for COCOA). Assuming that we are running the second version of the relaxation algorithm, which uses only a single k x k array of type double (8 bytes), we can fit a problem with a grid as large as 45 x 45 in 16 KB of L1 data cache, and 64 x 64 in 32 KB of L1 data cache.

**Figure 1:** Cache size and Mflops on *melmak.cse.psu.edu*
**Figure 2:** Cache size and Mflops on *COCOA*
$\begin{figure} \centerline{\psfig {figure=figures/melmak.eps,angle=0,height=10... ...psfig {figure=figures/cocoa.eps,angle=0,height=10cm,width=11.2cm}} \end{figure}$

The output from the second program (run on COCOA) is used to generate the plots in Figures 3 and 4. Figure 3 plots the communication time vs message size for all the processor pairs as a scatter and fits two straight lines on the data, one each for the two different slopes clearly seen in the plot. The discontinuity is seen at around a message size of 1500 bytes, which can be easily explained by the fact that the Maximum Transfer Unit (MTU) set on the ethernet cards for each of the nodes on COCOA is also 1500 bytes. Thus, messages smaller than 1500 bytes often end up leaving holes in the packet, thus decreasing the bandwidth. From the figure, the start-up time (t_s: y-intercept of the line) for messages upto 1500 bytes is noted as 181.6 $\mu sec$ , and that for messages larger than 1500 bytes is seen to be 275.2 $\mu sec$ . The incremental message cost (t_w: slope of the line) for message smaller than 1500 bytes is seen to be 0.1509 $\mu sec$ /byte, corresponding to a bandwidth of 53.02 Mbits/sec. For messages greater than or equal to 1500 bytes, the incremental message cost goes down to 0.0918 $\mu sec$ /byte, corresponding to a bandwidth of 87.15 Mbits/sec. The run on the Pond Lab machines could not be completed due to the unavailability of sufficient resources and the lack of time, but a preliminary analysis showed that the start-up time of its network was approximately 240 $\mu sec$ for small message sizes.

From figure 4, we can see that although the net communication bandwidth (i.e., sum of communication bandwidths of each communicating pair of processors) increases with the number of processors, it is not exactly linear. When the number of simultaneaously communicating processor pairs becomes large, the communication time between every pair increases, as the backplane bandwidth of the switch gets saturated, and the messages can no longer be communicated at the same speed. It can be clearly seen in the figure, that the 12 processor case is slower by about 50 $\mu sec$ as compared to the 2 or 6 processor case. Once the number of communicating pair becomes sufficiently large, the net communication bandwidth will be solely dictated by the backplane bandwidth of the switch (which is 2.4 Gbps in the case of COCOA), and can in no circumstances exceed that.

**Figure 3:** Communication time vs. message size on *COCOA*
**Figure 4:** Communication time for different number of processors on COCOA
$\begin{figure} \centerline{\psfig {figure=figures/message.eps,angle=0,height=1... ...{figure=figures/proc_compare.eps,angle=0,height=10cm,width=11.2cm}} \end{figure}$

Next: Conclusions Up: CSE 557 Spring 2000 Previous: Approach

Anirudh Modi
2000-02-21