The vi editor is useful as being universally available on unix machines and since it does not require a forwarded xterm (See lab 1). Perhaps you are not fond of vi and would prefer to think about something else rather than its commands. There are other available editors. One easy to use editor is "nano".
>nano myfile
Using Windows edited files in linux or unix often fails because the "new line" symbol differs in Windows and linux. A Windows file "foolfile" can be converted to linux format by typing
dos2unix foofile
Many Unix users prefer the emacs editor to vi. It requires forwarding
a GUI to a remote terminal.
If you're logging in via putty from an NCSU lab machine, first click the
xwin program. At the prompt
>emacs myfile
>emacs -nw myfile
These labs are not at all complete. The HowTo page at HowTo has useful information on how to get started, how to manage storage, commonly encountered errors, etc. More specific information (on some commonly used software packages) is available from Software.
mpif90 -c monte2.f ranpak.f
mpif90 monte2.o ranpak.o -o monte2
If you want to see how a makefile for this simple problem would look,
try a file makemonte2 consisting of
FC = mpif77 mont: $(FC) -c monte2.f ranpak.f $(FC) -o monte2 monte2.o ranpak.o
make -f makemonte2
Traditionally, unix users would port codes by modifying make files. More recently, one runs a "configure script", which automatically reconfigures the make file to match the local compilers, compiler options, and library locations (these are inputs from the user).
Here's a link to a more complete unix/linux set of tutorials Professor Norm Matloff, CS UC Davis Tutorial Center.
Then use an MPI_ALLREDUCE call to deposit the answer on all the processors. Here's the argument list for an MPI_ALLREDUCE and the Fortran declarations.
MPI_ALLREDUCE(SENDBUF,RECVBUF,COUNT,DATATYPE,OP,COMM,IERROR)
Modify your run script from the first lab. Get it to run.
Uncomment the MPI_Barrier. Compare the times.
Actually the most commonly used engineering language may
not be C or Fortran. It's probably Matlab. Why? Because
we can first try out our code and see if it does what we
want it do, and then, when we're happy with it but need it
to run faster then we convert it to Fortran or C.
So wouldn't it be great if we could parallelize our Matlab
code with MPI calls. Then we could easily try out communication
patterns. And actually it turns out we can. We've installed
the bcmpi package from Ohio Supercomputing. For help in using bcmpi,
see the page
bcMPI.
This following is mainly based on Chapter 5 of Goedecker and Hoisie
and is one of the CSC 783 lectures.
I sometimes think that the main way to speed code serial or parallel, is to keep data local.
In fast cache, or main cache, or RAM, or in parallel to have
"prefetched it" from some other processor memory.
But keeping data local doesn't guarantee full-speed performance. It's really very easy to have your data local and still have slow code. As you've now seen,
even an in-cache matrix vector multiply may not run at the full processor
speed. Having profiled to see what part of the code that takes a good bit of time,
it's worth concentrating on it and seeing if were it in cache could we make
it go pretty fast. Some would say this is "too hard" , but usually I've
found that if a class keeps playing with something after a while its
speeds improve a lot. And the Goedecker and Hoisie tricks are not that
hard to play--and can help a lot.
A caveat.
For scientific computations we tend to just want to count floating point
operations, but in general integer operations are also important. For
example in a sparse matrix vector multiply, we store only the nonzero
entries of a matrix along with integer vectors. In the simplest scheme
there is an integer vector for which the
If the 3 vectors are afloat, indx, indy, then the element afloat(k) corresponds to the element A(indx(k),indy(k)) of the original matrix. So the integer
computations for indexing, become important in a matrix vector multiply.
But at least let's start with making in-cache flops go fast.
>a = e*f + b+ c*d
which looks like 3 clock cycles? But actually, there is also
a dependency. So there's likely to be an extra delay between
each operation. So perhaps the first version is 3 clock cycles and
the second one is 5? Often if we have a loop, we can eliminate the
dependency.
Hopefully, the chip and compiler designers are aware of these same issues
and automating some of the following with thread-level parallelism and
ability to do "out-of-order" computation? The following example shows
the compiler and chip designers haven't yet . . .
Consider a sum computation.
Note that only adds are performed. Let's try putting a multiply in also
so that the computational line is
There are a couple
of possible solutions. One is to take a greater depth of loop-unroll, perhaps
instead of 4, we should unroll to a length of 12? The authors give an example
where loop-unrolling of depth 12 worked very well on an Alpha chip.
Possibly we should make the code more explicit by making some new variables and
a prefetch so there is no dependency on the previous step ?
A solution that seems to work in on Pentiums in the middle of a matrix vector multiply
is to loop unroll in both the i and j directions. Then the machine can
be loading several columns of A at the same time.
If we write a matrix vector multiply as (Fortran, for C transpose the discussion)
But if we block the code then we can "usually" have all the elements
of a, b, and x ready and we don't have to reload elements of b the n times.
Here b is reloaded only n/4 times.
On 32 register machines, it made some sense to worry about when loops
got too long to fit in registers and spilled out, e.g., this might make
sense on the p690. The Xeon has only 8 registers, so we'll "always" spill.
But if we can do loops which maximize the amount of data that stays in
L1 cache, that can be a big help. Similarly, it's better to keep data in
L2 than have it spill into DRAM ...
Loop-unrolling has been used for fifteen or twenty years.
Prominently Dongarra was able to use the first Crays to full
potential by developing vector operations as the BLAS-1
library. Now we need the trick to get reasonable performance
Now,
loop-unrolling is accomplished by many opitimizing compilers.
Often loop-unrolling is part of the compiler default optimization.
But the loops are typically unrolled in only one dimension.
It's hard to choose the optimal loop-unrolling parameters automatically
and many times compilers get it wrong.
A priori I tend to think that loops
such as the above work best if more or less square such as the 4 by 4
example, but exhaustive searches of the best loops lead to long skinny
blocks. (Clint Whaley's ATLAS BLAS automatically -- actually by timing some choice--chooses the "best" or nearly the best. There's also a Berkeley
competitor, but it's in my experience less robust. Philosophically
the Berkeley BLAS works more from first principles for a given machine,
the ATLAS BLAS is empirical, timing various options and choosing the best).
If you unroll by hand don't forget to work out the "remainder" pieces
carefully. It's a good time to compare some various levels of
unrolling. As an example, of when you might unroll by hand, consider
matrix operations that don't quite fit in the BLAS routines. For example,
I need to produce the conjugate of a complex vector and my PGI Fortran program
will do only 20 million conjugates a second.
Aliasing is when data in a given memory location is referred to under
two different names. In C this is allowed. In Fortran, compilers are
allowed to assume this does not occur. Traditionally, this has meant
that Fortran compilers have more information and hence can perform
optimizations that would not be allowable for a C compiler.
So in fortran, if A is an array, then typically the call
> call myfoo(a,a,4)
would not be allowable because the array A is referenced to have two
different meanings inside
> subroutine myfoo(a,b,4)
The non-aliasing rule is not enforced in Fortran, but users who use
aliasing may find their code is not portable.
Because aliasing is not allowed, the Fortran compiler writer has a
little better chance produce efficienct code.
That being said, C is close to hardware and ATLAS BLAS for example is written
in C.
There are some nice operations in Fortran 90. For example you can add
two vectors by
Also Fortran 90 and C++ allow operator overloading. This can make code
writing easier. But for purposes of performance temporary space is allocated
run time. That's a time consuming task, perhaps equivalent to an interrupt.
System interrupts, e.g., for floating point exceptions may be known to be
harmless. For example, underflow may just mean zero. It's possible to
test for NaNs (not a numbers) and take another branch if these occur.
When I was trying out the "perfect" code of a few sections further on,
I found that I could run with NaNs at at about 2 million (meaningless)
flops per second. But putting an "if" to test for each NaN would be slower.
Modern chips cope better with branches than they used to. But hey even
straight ahead they are usually running only at 5 or 10 per cent of peak,
so having lots of branches "ready on spec" only means you'll continue along at
5 % (which is of course better than the alternative of just stalling
while everthing for the next after branch step is fetched).
Does the branch prediction paradigm have ramifications for the C (or Perl)
style of putting many subroutine calls inside an "if" error handler?
Type conversions cost more than flops.
Conjugations or changing sign cost more than flops?
Complex arithmetic is good in that complex arithmetic has twice as
many flops per byte of memory access.
Special function evaluations are usually costly. Two approaches to save
time, use interpolation or find the fast but not quite as accurate
a library (or combine these two).
There is an overhead for calling a subroutine. Hoisie and Goedecker claim
this is about the same as loop overhead per incrementing a loop, about
ten cycles on the machine they tested. Many compilers have
compiler flags to in-line of designated subroutines. An additional
hazard to subroutines and function in C and Fortran 90 is that many
variables may have local scope, so a private copy is made for the
local function. Possibly advantage is code safety, but does
take time. On some Fortran90 compilers, subarray notation causes
a copy to be made for a subroutine.
There are 3 different BLAS packages on the blades that can achieve
higher than 5 Gflops. The advertised peak speed is 6 Gflops.
The following program imitates one given in Goedecker and Hoisie
and they were able to get peak speed with it on an IBM Power architecture.
So let's try it on the BladeCenter.
I do timings by first getting an interactive shell by
>bsub -Is -W 5 csh
which gives me an interactive shell so that I can then just type
the name of the executable. The bsub command gives me a processor
of my own to play with, giving more faith that I'm getting the whole
processor. Also the burst of time I take doesn't cause someone else
a delay on the head node. If the name of the executable file is peak,
then I can run it by typing
>./peak
The program should run at peak speed? It has lots of good paired
matrix vector multiplies. There are 10 variables so it doesn't
quite fit in the 8 registers. But no statement depends on the last
so surely the compiler and chip should be able to do a couple
of flops per clock cycle?
Of course, one needs lots of repeats to get more than the one hundredth
second of the Linux clock function. And the repeats first caused the
x1, x2, etc. to overflow the floating point numbers, so the initial
rate was 2 Mflops (of meaningless flops).
So put in the averaging to pull numbers back toward one every so often.
Then the rate was about 1100 Mflops for the intel ifc compiler.
The pgf90 compiler was actually faster at 1300 Mflops.
So let's simplify the code and then see what the assembler looks like.
Here's a simplified version
Compiling with
>pgf77 -S peak3.f
gives the assembler language code produced by the PGI f77 compiler.
Conclusion, it's going to be tough to get these processors to
run at faster than about 1200 Mflops (20% of advertised peak).
So when we look at code efficiency
we may often be satisfied if we can get 600 Mflops (10% of peak)
per processor.
In comparison, I saw an estimate from a German professor of scientific
computation, that the average code in a high performance computing site
runs at 1% of peak efficiency.
Rework your matrix vector multiplies. Try loop unrolling in both
the row and column and see if you can get nearer to the
expected in-cache performance of 1200 Mflops. Also can you
find good compiler optimizations?
Alternately, you could explain how to make the "ideal" code run at
the advertised rate.
Here's a reference from Intel
Speeding up Matrix Vector Multiplication
INTEGER COUNT, DATATYPE, OP, COMM, IERROR
IN sendbuf -- choose a data type
OUT recvbuf -- typically choose the same data type as in the sendbuf
IN count integer number of elements in send buffer
IN datatype MPI data type of elements of send buffer
IN op operation -- adding, getting max, etc.
IN comm communicator -- for example MPI_COMM_WORLD
OUT ierror integer, zero if successful
>a = b+c*d
in one clock cycle. Then the sum
>a = b+ c*d + e*f
can be rewritten in two lines as
>a = b+ c*d
>a = a + e*f
which could be 2 clockcycles. But if the original sum is switched to be written as
Then it's pretty likely that the compiler will execute it as
>a = e*f
>a = a + b
>a = a + c*d
program getsum
C This is a driver program to test Mflop rate for a sum
implicit none
integer lda, n, i, j,k, idum
parameter(lda=2000)
real*8 a(lda,lda),x(lda)
real*8 temp,mflops,sum ,tot ,t
real*4 ftime, pretim, entim
external ftime
print*, 'input n'
read*, n
c Initialize matrices a and b
do j = 1,n
do i = 1,n
a(i,j) = 1.d0/(i+2*j-1)
end do
end do
pretim = ftime()
c
c Write your matrix vector multiply here
c
do k = 1, 200
do j = 1,n
sum = 0.d0
do i=1,n
t = a(i,j)
c sum = sum + abs( a(i,j) )
sum = sum + t
end do
x(j) = sum
tot = x(j) + sum
end do
end do
entim = ftime() - pretim
print* , 'sum =', sum
print*,' entim =', entim
if (entim.lt.1.e-6) then
print*,' too fast to time'
else
c
c The next formula assumes the matrix was square n by n
c Possibly, you will have to repeat the matvec
c k times to get a timing. If so you will want
c to multiply the RHS of the equation below by k.
c
mflops = k* (n)**2/entim/1.d6
print*,' Mflop rate of', mflops
endif
stop
end
This ran at around 100 Mflops. (problem the timer kept returning
very small nonzero numbers, so put in a less than test). So imitating
the suggestion of Goedecker and Hoisie, let's "unroll" the loop.
program getsum
C This is a driver program to test Mflop rate for a sum
implicit none
integer lda, n, i, j,k, idum
parameter(lda=2000)
real*8 a(lda,lda),x(lda)
real*8 temp,mflops,sum ,tot ,t ,sum1,sum2,sum3
real*4 ftime, pretim, entim
external ftime
print*, 'input n'
read*, n
c Initialize matrices a and b
do j = 1,n
do i = 1,n
a(i,j) = 1.d0/(i+2*j-1)
end do
end do
pretim = ftime()
c
c Write your matrix vector multiply here
c
do k = 1, 200
do j = 1,n
sum = 0.d0
sum1 = 0.d0
sum2 = 0.d0
sum3 = 0.d0
do i=1,n,4
sum = sum + ( a(i,j) )
sum1 = sum1 + ( a(i+1,j) )
sum2 = sum2 + ( a(i+2,j) )
sum3 = sum3 + ( a(i+3,j) )
end do
x(j) = sum1+sum2+sum3+sum
tot = x(j) + tot
end do
end do
entim = ftime() - pretim
print* , 'sum =', sum
print*,' entim =', entim
if (entim.lt.1.e-6) then
print*,' too fast to time'
else
c
c The next formula assumes the matrix was square n by n
c Possibly, you will have to repeat the matvec
c k times to get a timing. If so you will want
c to multiply the RHS of the equation below by k.
c
mflops = k* (n)**2/entim/1.d6
print*,' Mflop rate of', mflops
endif
stop
end
These ran typically at 200 Mflops, leading one to believe that the
Xeon can handle two independent streams. Some surprises
>sum = sum + a(i,j)*a(i,j)
One might expect that an add and a multiply could then be perform
on each step.
Then the flop rate for the ifc compiler was about 20% faster (1200 to 1500
Mflops). This is faster then the rate the data could be sucked
from RAM, but still not nearly as fast as the advertised 6 Gflops.
sum = sum + a(i,j) * a(i,j)
sum1 = sum1 + a(i+1,j) * a(i+1,j)
sum2 = sum2 + a(i+2,j) * a(i+2,j)
sum3 = sum3 + a(i+3,j) * a(i+3,j)
where presumably the machine is actually executing
a0 = a(i,j)
sum = sum + a0 * a0
a1 = a(i+1,j)
sum1 = sum1 + a1 * a1
a2 = a(i+2,j)
sum2 = sum2 + a2 * a2
a3 = a(i+3,j)
sum3 = sum3 + a3 * a3
This is in the case that the matrix A is small enough (80 by 80) say
which is 80*80*8 = 64000 bytes to be easily within cache memory. So apparently
the machine is not able to do flops on each clock cycle. Since there are
no "stores" needed with sum, sum1, sum2, sum3 fitting in registers, it must
be that loads of a(i,j) are not occuring fast enough. Or that the load operation
has a delay before the multiply add can occur.
a2 = a(i+2,j)
sum = sum + a0 * a0
a3 = a(i+3,j)
sum1 = sum1 + a1 * a1
a0 = a(i+4,j)
sum2 = sum2 + a2 * a2
a1 = a(i+5,j)
sum3 = sum3 + a3 * a3
where this overlaps the loop content to avoid dependency (making the first
and last steps a bit more complicated).
do i=1,n
do j=1,m
b(i) = b(i) + a(i,j) * x(j)
end do
and do
then on each step of the inner loop we need a new value of x(j), which since
x is stored sequentially, has probably already been fetched (when x(j-1) was
fetched). But we are accessing a by the row, so a(i,j) is probably not
around. Also we're having to reload b(i). For an in-cache calculation,
we would be likely to have b(i) in L1 cache, a(i,j) in L2 cache. So fetching
b(i) is not so bad. Switching the order of loops
do j=1,n
do i=1,m
b(i) = b(i) + a(i,j) * x(j)
end do
end do
then on each step of the inner loop we need both a new value of a(i,j) and
also a new value b(i). We already have x(j) in register. Fortunately,
a(i,j) and b(i) were probably fetched on the last step. But still to
go through one inner loop, we have to load the entire column of a and also
the entire vector b. i.e., b has to reloaded m times ( presumably
only from L1 cache or for a large enough problem from L2 cache).
do j=1,n,4
do i=1,m,4
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1) + a(i,j+2)*x(j+2)
+ a(i,j+3)*x(j+3)
b(i+1) = b(i+1) + a(i+1,j)*x(j)+a(i+1,j+1)*x(j+1)+ a(i+1,j+2)*x(j+2)
+ a(i+1,j+3)*x(j+3)
b(i+2) = b(i+2) + a(i+2,j)*x(j)+a(i+2,j+1)*x(j+1)+ a(i+2,j+2)*x(j+2)
+ a(i+2,j+3)*x(j+3)
b(i+3) = b(i+3) + a(i+3,j)*x(j)+a(i+3,j+1)*x(j+1)+ a(i+3,j+2)*x(j+2)
+ a(i+3,j+3)*x(j+3)
end do
end do
c And of course we need some clean up steps when m, n are not
c multiplies of 4
> c = a(1:4) + b(1:40)
which is much nicer for the programmer than writing a loop. But when
it comes times to optimize, you probably convert most of these f90 calls
to BLAS calls (at least for frequently used portions of the code). Especially
for calls repeated many times for short vectors or small matrices.
program peak
C This is a "peak" performance try.
implicit none
integer lda, n, i, j,k, idum
parameter(lda=2000)
real*8 x1,x2,x3,x4,y1,y2,y3,y4,s,t, mflops, mean
real*4 ftime, pretim, entim
external ftime
print*, 'input n'
read*, n
c Initialize matrices a and b
x1 = 1.d-1
x2 = 5.d-1
x3 = 4.d-1
x4 = 1.d-1
y1 = 9.d-1
y2 = 2.d0
y3 = 1.d0
y4 = 2.d0
s = -.125d-4
t = .125d-4
pretim = ftime()
c
c Write the fast flops here
c (lots of fused add, mutliplies, no new data,
c no dependencies)
c But on the Xeon, even this will spill the registers.
c
do j=1,1000*n
do i=1,n
y1 = x1 + s*x2 + t*x3
y2 = x2 + s*x1 + t*x4
y3 = x3 + s*x4 + t*x1
y4 = x4 + s*x3 + t*x2
x1 = y1 + s*y2 + t*y3
x2 = y2 + s*y1 + t*y4
x3 = y3 + t*y1 + s*y4
x4 = y4 + t*y2 + s*y3
end do
c Some extra stuff to make sure we don't overflow
mean = abs(x1)+abs(x2)+abs(x3)+abs(x4)+
+ abs(y1)+abs(y2)+abs(y3)+abs(y4)
mean = 1.d0/mean
x1 = x1*mean
x2 = x2*mean
x3 = x3*mean
x4 = x4*mean
y1 = y1*mean
y2 = y2*mean
y3 = y3*mean
y4 = y4 *mean
end do
entim = ftime() - pretim
print* , 'x1 =', x1
print*,' entim =', entim
if (entim.lt.1.e-4) then
print*,' too fast to time'
else
c
c The next formula assumes the matrix was square n by n
c Possibly, you will have to repeat the matvec
c k times to get a timing. If so you will want
c to multiply the RHS of the equation below by k.
c
mflops = (32.d0*(1.d3*n)*n + 16*(n*1.d3) )/entim/1.d6
print*,' Mflop rate of', mflops
endif
stop
end
program peak
C This is a "peak" performance try.
implicit none
integer lda, n, i, j,k, idum
c parameter(lda=100)
real*8 x1,x2,x3,x4,y1,y2,y3,y4,s,t, mflops, mean
c Initialize matrices a and b
n = 100
x1 = 1.d-1
x2 = 5.d-1
x3 = 4.d-1
x4 = 1.d-1
y1 = 9.d-1
y2 = 2.d0
y3 = 1.d0
y4 = 2.d0
s = -.125d-4
t = .125d-4
c
c Write the fast flops here
c (lots of fused add, mutliplies, no new data,
c no dependencies)
c But on the Xeon, even this will spill the registers.
c
do j=1,1000*n
do i=1,n
y1 = x1 + s*x2 + t*x3
y2 = x2 + s*x1 + t*x4
y3 = x3 + s*x4 + t*x1
y4 = x4 + s*x3 + t*x2
x1 = y1 + s*y2 + t*y3
x2 = y2 + s*y1 + t*y4
x3 = y3 + t*y1 + s*y4
x4 = y4 + t*y2 + s*y3
end do
c Some extra stuff to make sure we don't overflow
end do
c
c The next formula assumes the matrix was square n by n
c Possibly, you will have to repeat the matvec
c k times to get a timing. If so you will want
c to multiply the RHS of the equation below by k.
c
stop
end
.file "peak3.f"
.version "01.01"
# PGFTN 5.0 -opt 1 -norecursive
# PGFTN 08/26/2004 12:54:24
# pgf77 peak3.f -S
# /usr/local/pgi/linux86/5.0/bin/pgftn
# peak3.f -opt 1 -terse 1 -inform warn -x 119 0xa10000 -x 119 0x100000 -x 122 0x40
# -x 123 0x1000 -x 127 4 -x 119 0x40000000 -x 80 0x300 -y 80 0x1000 -x 80 0x40000000
# -x 51 0x20 -x 124 0x1401 -x 80 0x300 -y 80 0x1000 -x 80 0x40000000 -astype 0
# -stdinc /usr/local/pgi/linux86/5.0/include:/usr/local/include:/usr/lib/gcc-lib/i386-redhat-linux/2.96/include:/usr/lib/gcc-lib/i386-redhat-linux/2.96/include:/usr/include
# -def unix -def __unix -def __unix__ -def linux -def __linux -def __linux__
# -def __inline__= -def i386 -def __i386 -def __i386__ -def __NO_MATH_INLINES
# -def linux86 -def __THROW= -cmdline +pgf77 peak3.f -S -x 124 1 -asm peak3.s
# lineno: 1
.text
.align 4
.long ..EN1_296-MAIN_+0xc8000000
.align 16
.globl MAIN_
MAIN_:
pushl %ebp
movl %esp,%ebp
subl $32,%esp
movl %ebx,12(%esp)
movl %edi,8(%esp)
..EN1_296:
# lineno: 0
movl $100,%eax
movl %eax,.BSS1
fldl .C1_316
fstl .BSS1+8
fldl .C1_295
fstpl .BSS1+16
fldl .C1_317
fstpl .BSS1+24
fstpl .BSS1+32
fldl .C1_318
fstpl .BSS1+40
fldl .C1_294
fstl .BSS1+48
fld1
fstpl .BSS1+56
fstpl .BSS1+64
fldl .C1_320
fstpl .BSS1+72
fldl .C1_319
fstpl .BSS1+80
movl $100000,%edi
movl $1,%edx
movl %edx,.BSS1+88
testl %edi,%edi
jle .LB1_324
.LB1_322:
# lineno: 30
movl .BSS1,%ebx
movl $1,%eax
movl %eax,.BSS1+92
testl %ebx,%ebx
jle .LB1_327
.LB1_325:
# lineno: 31
fldl .BSS1+24
fldl .BSS1+80
fld %st(1)
fmul %st(1),%st
fldl .BSS1+16
fldl .BSS1+72
fld %st(1)
fmul %st(1),%st
fldl .BSS1+8
fstl -8(%ebp)
fadd %st,%st(1)
fxch %st(1)
faddp %st,%st(4)
fxch %st(3)
fstl .BSS1+40
fldl .BSS1+32
fstl -16(%ebp)
fld %st(5)
fmul %st(1),%st
fxch %st(5)
fmul %st(3),%st
fadd %st(4),%st
faddp %st,%st(5)
fxch %st(4)
fstl .BSS1+48
fldl -8(%ebp)
fmul %st(6),%st
fxch %st(5)
fmul %st(3),%st
fadd %st(7),%st
faddp %st,%st(5)
fxch %st(4)
fstl .BSS1+56
fxch %st(3)
fmul %st(5),%st
fxch %st(6)
fmul %st(2),%st
fldl -16(%ebp)
faddp %st,%st(1)
faddp %st,%st(6)
fxch %st(5)
fstl .BSS1+64
fld %st(4)
fmul %st(3),%st
fld %st(2)
fmul %st(5),%st
fadd %st(7),%st
faddp %st,%st(1)
fstpl .BSS1+8
fld %st(4)
fmul %st(1),%st
fld %st(2)
fmul %st(7),%st
fadd %st(5),%st
faddp %st,%st(1)
fstpl .BSS1+16
fld %st(1)
fmul %st(1),%st
fxch %st(6)
fmul %st(5),%st
fadd %st(3),%st
faddp %st,%st(6)
fxch %st(5)
fstpl .BSS1+24
fmulp %st,%st(1)
fxch %st(1)
fmulp %st,%st(2)
fxch %st(1)
faddp %st,%st(2)
faddp %st,%st(1)
fstpl .BSS1+32
movl .BSS1+92,%eax
addl $1,%eax
movl %eax,.BSS1+92
subl $1,%ebx
jg .LB1_325
.LB1_327:
# lineno: 41
movl .BSS1+88,%eax
addl $1,%eax
movl %eax,.BSS1+88
subl $1,%edi
jg .LB1_322
.LB1_324:
# lineno: 43
xorl %eax,%eax
movl %eax,4(%esp)
movl %eax,(%esp)
call ftn_stop
# lineno: 52
movl -20(%ebp),%ebx
movl -24(%ebp),%edi
movl %ebp,%esp
popl %ebp
ret
.type MAIN_,@function
.size MAIN_,.-MAIN_
.data
.align 8
.C1_319:
.long 0xeb1c432d, 0x3eea36e2 # 1.25000000000000006E-5
.C1_320:
.long 0xeb1c432d, 0xbeea36e2 # -1.25000000000000006E-5
.C1_294:
.long 0x0, 0x40000000 # 2.00000000000000000E+0
.C1_318:
.long 0xcccccccd, 0x3feccccc # 9.00000000000000022E-1
.C1_317:
.long 0x9999999a, 0x3fd99999 # 4.00000000000000022E-1
.C1_295:
.long 0x0, 0x3fe00000 # 5.00000000000000000E-1
.C1_316:
.long 0x9999999a, 0x3fb99999 # 1.00000000000000006E-1
# STATIC VARIABLES
.data
.local .BSS1
.comm .BSS1,96,8
# COMMON BLOCKS
Looking at the assembler, we see a correspondence to the Fortran code
line numbers. There are 65 assembler code lines correponding to
the 32 flops inside the block. So if there is a flop every other
clock cycle, it looks as if one assembler instruction is being
executed every clock cycle. Corresponding to the 16 floating
point adds are 6 fadd instructions and 10 faddp . To the 16 floating
point multiplies correspond 14 fmul and 2 fmulp instructions.
Here's a reference to Intel assembler Corresponding to the 16 floating
point adds are 6 fadd instructions and 10 faddp . To the 16 floating
point multiplies correspond 14 fmul and 2 fmulp instructions.
Here's a reference to Intel assembler. So far my code rearrangements
have not helped.
Pentium Assembler