To-Do List

  • code update: to-do list (12/18)
  • get ratio figures online (12/18)
  • add description to ratio figs (12/18)
  • notable progress on 'Results' (12/19)
  • finish ModYY description (12/20)

Monday

02.00p-05.30p working on boundary error bug

         - no progress

Thursday

06.00a-12.00p working on boundary error bug

         - added checkpoints for array sizes

12.00p-01.30p adding bug explanation to website

04.00p-06.00p working on boundary error bug

         - boundaries are now communicating, still errors

Friday

08.00a-12.00p working on boundary error bug

         - likely just a typo/math error at this point

12.00p-01.30p updating bug explanation on site

03.00p-04.00p code cleanup

04.00p-04.30p adding work log (this) to website

05.00p-06.00p finishing bug writeup

06.00p-07.00p finishing work log

         - added auto-formatting, column auto-resizing

08.00p-01.00a finishing code cleanup

         - ran into the original mpi_alltoallv issue again

Saturday

10.00a-01.00p more code cleanup

         - finished, sent to John

11/16-11/20

Boundary Implementation

Data Sharing Across Processors

Data sharing is broken into three steps. The "Loading" step, the "Distributing" step, and the "Receiving" step.

To prepare for data sharing, the code first runs a setup routine for each boundary, where it calculates the following numbers

  • nZonesLoad - the number of zones an individual processor owns which are needed for the boundary calculation on the OTHER grid.
  • nZonesDist - the number of zones an individual processor will receive from it's paired processor(s), which need to be sent to PEs on the processor's own grid.
  • nZonesRecv - the number of zones an individual processor should receive from its distributing processor.

and the following arrays

  • loadCount -
  • distCount -

The Loading Step

To set up the loading step, each processor loops through all ghost zones on its opposing grid and records which local zones it owns that are used to calculatie those ghost zones.

The total number of zones are recorded in nZonesLoad, while the j- and k-values of each zone is recorded in the 1D arrays jSendLoc and kSendLoc, respectively. The loop over all ghost zones is set up that data kept organized by (other grid) processor as it is loaded in to these arrays. i.e.,

               _______________________________________________________
              |             |             |             |             |
              | j-vals  for | j-vals  for | j-vals  for | j-vals  for |
   jSendLoc = | pe(y/z) = 0 | pe(y/z) = 1 | pe(y/z) = 2 | pe(y/z) = n |
              | ghost zones | ghost zones | ghost zones | ghost zones |
              |_____________|_____________|_____________|_____________|

The structure of the loading loop over all ghost zones is as follows:


   nZonesLoadY = 0
   loadCount = 0
   do k = 1, kmaxYY
      myRecvPE = (k-1)/ksYY      // pe on OTHER grid
      gphi = zzcYY(k)
      jjmin = jnminYY(k) - 6
      jjmax = jnmaxYY(k) - 6
      do j = 1, 12
         if (j < 7) then
            gthe = zycYY(jjmin) - zdyYY(jjmin)*j
         else
            gthe = zycYY(jjmax) + zdyYY(jjmax)*(j-6)
         end if

         // (gthe,gphi) are on OTHER grid - find coords on THIS grid
         nthe = acos (sin(gthe) *sin(gphi))
         nphi = atan2(cos(gthe),-sin(gthe)*cos(gphi))

         n = 2
         do while (nthe > zyc(n))
            n = n + 1
         end do
         m = 2
         do while (nphi > zzc(m))
            m = m + 1
         end do

         // [... 'loading step' code ...]

      end do
   end do
   

where integers n and m correspond to the j- and k-values of the other grid's ghost zones' location on the local grid. The following code is used to record these values.


   // [... 'loading step' code ...]
   
   // for each zone [(n-0,m-0), (n-0,m-1), (n-1,m-0), (n-1,m-1)]
   // if a PE owns the ghost zone's data, record what it needs to send
   
   // 1. (n-0,m-0)
   mySendPE = (m-1)/ks     // pe on THIS grid
   loadCount(mySendPE,myRecvPE) = loadCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZonesLoadY = nZonesLoadY + 1
      jSendLocY(nZonesLoadY) = n-0
      kSendLocY(nZonesLoadY) = m-0
   end if
   
   // 2. (n-0,m-1)
   mySendPE = (m-2)/ks
   loadCount(mySendPE,myRecvPE) = loadCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZonesLoadY = nZonesLoadY + 1
      jSendLocY(nZonesLoadY) = n-0
      kSendLocY(nZonesLoadY) = m-1
   end if
   
   // 3. (n-1,m-0)
   mySendPE = (m-1)/ks
   loadCount(mySendPE,myRecvPE) = loadCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZonesLoadY = nZonesLoadY + 1
      jSendLocY(nZonesLoadY) = n-1
      kSendLocY(nZonesLoadY) = m-0
   end if
   
   // 4. (n-1,m-1)
   mySendPE = (m-2)/ks
   loadCount(mySendPE,myRecvPE) = loadCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZonesLoadY = nZonesLoadY + 1
      jSendLocY(nZonesLoadY) = n-1
      kSendLocY(nZonesLoadY) = m-1
   end if

After all ghost zones have been looped over, we also have the array loadCount which is only used to check values are correct after setup for the distributing step has completed. An example array from the Yin grid using npey=6 and npez=8 (yin: pey=4, pez=8; yang: pey=2, pez=8)


    yin |    32    34    36    38    40    42    44    46 |
   -------------------------------------------------------------- 
      0 |     0     0     0    17    17     0     0     0 |    34
      4 |     0     0    54    43    43    54     0     0 |   194
      8 |    41    60     6     0     0     6    60    41 |   214
     12 |    19     0     0     0     0     0     0    19 |    38
     16 |    17     0     0     0     0     0     0    17 |    34
     20 |    43    54     0     0     0     0    54    43 |   194
     24 |     0     6    60    41    41    60     6     0 |   214
     28 |     0     0     0    19    19     0     0     0 |    38
   -------------------------------------------------------------- 
        |   120   120   120   120   120   120   120   120 |   960


   yang |     0     4     8    12    16    20    24    28 |
   -------------------------------------------------------------- 
     32 |     0     0    15   115   115    15     0     0 |   260
     34 |     0     0    99     5     5    99     0     0 |   208
     36 |     1    96     6     0     0     6    96     1 |   206
     38 |   119    24     0     0     0     0    24   119 |   286
     40 |   115    15     0     0     0     0    15   115 |   260
     42 |     5    99     0     0     0     0    99     5 |   208
     44 |     0     6    96     1     1    96     6     0 |   206
     46 |     0     0    24   119   119    24     0     0 |   286
   -------------------------------------------------------------- 
        |   240   240   240   240   240   240   240   240 |  1920

The first column lists each PE on jcol=0 which will be loading data. The middle section lists how many zones are intended for each PE on the Yang grid. The final column lists total zones to be sent for each PE.

The Distributing Step

Setup for the distributing step is structured the same as that of the loading step. However, each processor now loops through all ghost zones on its own grid and determines which PEs on the other grid own data for those ghost zones. If the PE which owns that zone is (one of) the processor's paired PE(s), the processor then records the zone and the j- and k-values of the ghost zone it is destined for. These values are recorded 1D arrays structured the same as jSendLoc and kSendLoc. This structuring is important as it is required to share data efficiently between the two processors.


   nZonesDistY = 0
   distCount = 0
   do k = 1, kmax
      myRecvPE = (k-1)/ksort     // pe on THIS grid
      nphi = zzc(k)
      jjmin = jnmin(k) - 6
      jjmax = jnmax(k) - 6
      do j = 1, 12
         if (j < 7) then
            nthe = zycYY(jjmin) - zdyYY(jjmin)*j
         else
            nthe = zycYY(jjmax) + zdyYY(jjmax)*(j-6)
         end if

         // (nthe,nphi) are on THIS grid - find coords on OTHER grid
         gthe = acos (sin(nthe) *sin(nphi))
         gphi = atan2(cos(nthe),-sin(nthe)*cos(nphi))
         
         n = 2
         do while (nthe > zyc(n))
            n = n + 1
         end do
         m = 2
         do while (nphi > zzc(m))
            m = m + 1
         end do

         // [... 'distributing step' code ...]

      end do
   end do
   

where the distributing step code is


   // [... 'distributing step' code ...]

   // for each zone [(n-0,m-0), (n-0,m-1), (n-1,m-0), (n-1,m-1)]
   // if a PE will receive data from it's paired processor, record
   // where it needs to send that data

   // 1. (n-0,m-0)
   mySendPE = (m-1)/ksYY   // pe on OTHER grid (== pe on THIS grid)
   distCount(mySendPE,myRecvPE) = distCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZoneDistY = nZoneDistY + 1
      nRecvLocYbuff(nZoneDistY) = 1
      jRecvLocYbuff(nZoneDistY) = j
      kRecvLocYbuff(nZoneDistY) = k
   end if

   // 2. (n-0,m-1)
   mySendPE = (m-2)/ksYY
   distCount(mySendPE,myRecvPE) = distCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZoneDistY = nZoneDistY + 1
      nRecvLocYbuff(nZoneDistY) = 2
      jRecvLocYbuff(nZoneDistY) = j
      kRecvLocYbuff(nZoneDistY) = k
   end if

   // 3. (n-1,m-0)
   mySendPE = (m-1)/ksYY
   distCount(mySendPE,myRecvPE) = distCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZoneDistY = nZoneDistY + 1
      nRecvLocYbuff(nZoneDistY) = 3
      jRecvLocYbuff(nZoneDistY) = j
      kRecvLocYbuff(nZoneDistY) = k
   end if

   // 4. (n-1,m-1)
   mySendPE = (m-2)/ksYY
   distCount(mySendPE,myRecvPE) = distCount(mySendPE,myRecvPE) + 1
   if (mySendPE == mypez) then
      nZoneDistY = nZoneDistY + 1
      nRecvLocYbuff(nZoneDistY) = 4
      jRecvLocYbuff(nZoneDistY) = j
      kRecvLocYbuff(nZoneDistY) = k
   end if

Once all ghost zones have been looped over, we now have all data organized. The final setup step is to distribute these arrays to each processor using the same method that will be used to distribute actual boundary data (so that each processor knows what order it will receive boundary data). To determine how the data is sent, we use the distCount array. An example for the same processor configuration as above gives


    yin |     0     4     8    12    16    20    24    28 |
   -------------------------------------------------------------- 
      0 |     0     0    15   115   115    15     0     0 |   260
      4 |     0     0    99     5     5    99     0     0 |   208
      8 |     1    96     6     0     0     6    96     1 |   206
     12 |   119    24     0     0     0     0    24   119 |   286
     16 |   115    15     0     0     0     0    15   115 |   260
     20 |     5    99     0     0     0     0    99     5 |   208
     24 |     0     6    96     1     1    96     6     0 |   206
     28 |     0     0    24   119   119    24     0     0 |   286
   -------------------------------------------------------------- 
        |   240   240   240   240   240   240   240   240 |  1920
 

   yang |    32    34    36    38    40    42    44    46 |
   -------------------------------------------------------------- 
     32 |     0     0     0    17    17     0     0     0 |    34
     34 |     0     0    54    43    43    54     0     0 |   194
     36 |    41    60     6     0     0     6    60    41 |   214
     38 |    19     0     0     0     0     0     0    19 |    38
     40 |    17     0     0     0     0     0     0    17 |    34
     42 |    43    54     0     0     0     0    54    43 |   194
     44 |     0     6    60    41    41    60     6     0 |   214
     46 |     0     0     0    19    19     0     0     0 |    38
   -------------------------------------------------------------- 
        |   120   120   120   120   120   120   120   120 |   960

Note that these arrays are identical to those produced in the 'Loading' step, other than they are displayed by the opposite grids.

Code Errors (outdated)


   mpirun -np 48 --bind-to-core ./vh1-inner
      STEP =          0
        dt =   0.000024
      time =   0.000000
   [tycho.physics.ncsu.edu:20459] *** An error occurred in MPI_Recv
   [tycho.physics.ncsu.edu:20459] *** on communicator MPI_COMM_WORLD
   [tycho.physics.ncsu.edu:20459] *** MPI_ERR_TRUNCATE: message truncated
   [tycho.physics.ncsu.edu:20459] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
There is an interesting issue with array overwrite during the mpi_alltoallv call on the final two processors.

   <type>  :: sendBuffer(*), recvBuffer(*)
   integer :: sendCount(*),  sendOffset(*)
   integer :: recvCount(*),  recvOffset(*)
   
   call mpi_alltoallv(
      sendBuffer,
      sendCount,
      sendOffset,
      mpi_integer,
      recvBuffer,
      recvCount,
      recvOffset,
      mpi_integer,
      mpi_communicator,
      mpierr
   )

Send and Receive - Sizes


   pe |      mypey   mypez
   --------------------------------------------------------------------------------
   46 |        0       7

   alltoallv : sendCount, sendOffset -- recvCount, recvOffset
   pe |        0       1       2       3       4       5       6       7
   --------------------------------------------------------------------------------
   32 |        0       0       0      72      72       0       0       0 |     144 
   32 |        0       0       0       0      72     144     144     144 |
   --------------------------------------------------------------------------------
   32 |        1       2       3       4       1       2       3       4 |      20 
   32 |        0       0       0     168     240     312     480     480 |
   ================================================================================


   pe |      mypey   mypez
   --------------------------------------------------------------------------------
   47 |        1       7

   alltoallv : sendCount, sendOffset -- recvCount, recvOffset
   pe |        0       1       2       3       4       5       6       7
   --------------------------------------------------------------------------------
   32 |        0       0       0      72      72       0       0       0 |     144 
   32 |        0       0       0       0      72     144     144     144 |
   --------------------------------------------------------------------------------
   32 |        1       2       3       4       1       2       3       4 |      20 
   32 |        0       0       0     168     240     312     480     480 |
   ================================================================================

Per Processor Breakdown


   swapCountY from loading step (viewed as OTHER grid):

     |        0       1       2       3       4       5       6       7 |  
   --------------------------------------------------------------------------------- 
   0 |        0       0       0      72      72       0       0       0 |      144
   1 |        0      12     228     168     168     228      12       0 |      816
   2 |      168     228      12       0       0      12     228     168 |      816
   3 |       72       0       0       0       0       0       0      72 |      144
   4 |       72       0       0       0       0       0       0      72 |      144
   5 |      168     228      12       0       0      12     228     168 |      816
   6 |        0      12     228     168     168     228      12       0 |      816
   7 |        0       0       0      72      72       0       0       0 |      144
   --------------------------------------------------------------------------------- 
     |      480     480     480     480     480     480     480     480 |     3840


   swapCountY from distributing step (viewed as THIS grid):

     |        0       1       2       3       4       5       6       7 |  
   --------------------------------------------------------------------------------- 
   0 |        0       0      78     468     468      78       0       0 |     1092
   1 |        0      12     390      12      12     390      12       0 |      828
   2 |       12     390      12       0       0      12     390      12 |      828
   3 |      468      78       0       0       0       0      78     468 |     1092
   4 |      468      78       0       0       0       0      78     468 |     1092
   5 |       12     390      12       0       0      12     390      12 |      828
   6 |        0      12     390      12      12     390      12       0 |      828
   7 |        0       0      78     468     468      78       0       0 |     1092
   --------------------------------------------------------------------------------- 
     |      960     960     960     960     960     960     960     960 |     7680