The solver of the Perfect Club's finite difference code FDMOD. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv subroutine fd3d(it,nt,ldim,del,dt,p1,p2,src) c real p1(ldim,ldim,ldim), p2(ldim,ldim,ldim), src(nt) do k=2,ldim-1 do j=2,ldim-1 do i=2,ldim-1 p1(i,j,k) = (2.0 - 6.0*del)*p2(i,j,k) - p1(i,j,k) + + (p2(i+1,j,k) + p2(i-1,j,k) + + p2(i,j+1,k) + p2(i,j-1,k) + + p2(i,j,k+1) + p2(i,j,k-1) ) * del enddo enddo enddo c c Add source c if ( it .lt. nt) then p1(ldim/2,ldim/2,ldim/2) = p1(ldim/2,ldim/2,ldim/2) + src(it)*dt endif c return end ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As you'll note the "locality" is 2 so the parallelism is much less than appears. Here is "hand" coding with PCF directives and APP routines to control the bus loading and parallism. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv Notes: 1) mcpreg.h is a 'standard' header file w/ FORTRAN parameter statements related to the i860 cache and register sizes. 2) MCP_SREG is the i860 cache; the OS never touchs it - the program explicitly controls. 3) _svload = single precision vector load into register 4) Take careful note of teh 'endpoint' saving. 5) I have unrolled the y loop to increase locality but it's a small gain. Timings: Optimized scalar/vector APP/701 Elapsed time : 66.09400 sec Optimized scalar/vector and _fd3d_ KAP (automatic) parallelized APP/707 Elapsed time : 25.61100 sec Optimized scalar/vector and _fd3d_ KAP (automatic) parallelized APP/714 Elapsed time : 30.93000 sec Optimized scalar/vector and _fd3d_ pfp (hand) parallelized APP/714 Elapsed time : 11.48200 sec Note the low number of PEs on the APP. All seven buses are used because of the high vector nature (expect for case #1 with one PE). subroutine fd3d(it,nt,ldim,del,dt,p1,p2,src) c CRS INCLUDE 'mcpreg.h' CRS Make 6 vector "registers" from the cache less placement for 3 CRS scalar variables. INTEGER VL PARAMETER (VL=(((MCP_SREG_SIZE-3)/6)/2)*2) REAL*4 V1(VL),V2(VL),V3(VL),V4(VL),V5(VL),V6(VL) CRS the 6 vectors V1 thru V6 EQUIVALENCE (V1, MCP_SREG(1) ) EQUIVALENCE (V2, MCP_SREG(1+VL) ) EQUIVALENCE (V3, MCP_SREG(1+VL*2) ) EQUIVALENCE (V4, MCP_SREG(1+VL*3) ) EQUIVALENCE (V5, MCP_SREG(1+VL*4) ) EQUIVALENCE (V6, MCP_SREG(1+VL*5) ) CRS 3 scalar variables to be held in cache EQUIVALENCE (ENDPOINT, MCP_SREG(1+VL*6+1) ) EQUIVALENCE (DELV, MCP_SREG(1+VL*6+2) ) EQUIVALENCE (DELFACTOR, MCP_SREG(1+VL*6+3) ) real p1(ldim,ldim,ldim), p2(ldim,ldim,ldim), src(nt) CPCF PARALLEL MAXPARALLEL 12*MCP_NBUS() CRS Move outside of loop structure and bus control sections DELV = DEL DELFACTOR = 2.0 - 6.0*DELV CRS CPCF PRIVATE k,j,i,ii,MS CPCF PDO CPCF CRITICAL SECTION BUS do k=2,ldim-1 do j=2,ldim-1 do i=1,ldim,VL-2 CRS Strip mine in "working increments" which are 2 less than the vector length MS = MIN0 (LDIM+1-I,VL) CALL _SVLOAD(P1(i,j,k) ,1,V1,1,MS) CALL _SVLOAD(P2(i,j,k) ,1,V2,1,MS) CALL _SVLOAD(P2(i,j-1,k),1,V3,1,MS) CALL _SVLOAD(P2(i,j+1,k),1,V4,1,MS) CALL _SVLOAD(P2(i,j,k-1),1,V5,1,MS) CALL _SVLOAD(P2(i,j,k+1),1,V6,1,MS) CPCF END CRITICAL SECTION CRS Variable ENDPOINT has no value in the first "i" pass, see next comment IF ( i .GT. 1) V1(1) = ENDPOINT ***** ^^^^^ CRS Need to "save" the start point of next vector segment before it's CRS overwritten by the do ii loop . ENDPOINT = V1(MS-1) DO ii = 2, MS-1 V1(ii-1) = (DELFACTOR)*V2(ii) - V1(ii) + + (V2(ii+1)+V2(ii-1)+ + V3(ii)+V4(ii)+ + V5(ii)+V6(ii) + )*DEL ENDDO CPCF CRITICAL SECTION BUS CALL _SVSTOR(V1,1,P1(i+1,j,k),1,MS-2) enddo enddo enddo CPCF END CRITICAL SECTION CPCF END PARALLEL c c Add source c if ( it .lt. nt) then p1(ldim/2,ldim/2,ldim/2) = p1(ldim/2,ldim/2,ldim/2) + src(it)*dt endif c return end Good Luck, =Pat