groups:bes3:users:howell:farm:howell-t1 [School of Physics and Astronomy Wiki]

howell-T1 farm

Configuration

(lines starting with $ are what I entered to recover this information)

Created by:

  $ mc list howell-T1 -c
  mc new "howell-T1" --seed 696189625 --decay ecm3770 --number 4.915113e8 --actions simulate dst d-skim monitor --drop-actions simulate --output-directory /hdfs/bes3/users/howell/simulations/howell-T1

Farm environment:

$ grep export /data/bes3d2/bes3/users/howell/simulations/howell-T1/work/process
export WorkArea="/data/bes3d2/bes3/users/howell/simulations/howell-T1/work/boss-workarea"
export BesFarmArea=/data/bes3d1/bes3/farm
export McSimulationsDirectory=/data/bes3d2/bes3/users/howell/simulations/
export McTemporaryOutputDirectory=/dev/shm
export McOutputDirectory=/hdfs/bes3/users/howell/simulations
export McStagingDirectory=/dev/shm/staging-howell
export McTemplatesDirectory=/data/bes3d1/bes3/farm/templates
export McDecaysDirectory=/data/bes3d1/howell/farm/decays

Log

2011-05-27

Problem (appears to be) fixed in commit bb2f314196b3758a6ffb971636e56b9daa8f321c; it was caused by temporary output files being kept separate from the staging directory. After a job is finished, the staging directory gets cleaned up (if it succeeded), but the temporary output folder doesn't. We now only use the staging directory, and commit 668445192aa5659db52200c4773a43311d11ccd4 adds some insurance that it gets cleaned up and no data files get synced to the work directory.
Also added commit d147245991fe89f8cdfb3e3846d98e5145bec371, which keeps jobs which have completed from re-running. Jobs would never re-run a complete stage in the first place, but in the case of temporary stages (where the output isn't kept), it would try to re-run that in anticipation of non-temporary stages requiring the temporary output. Thus, this saves us a generation step in the event that reconstruction, etc., are all finished.

2011-05-23

even after sed script, farm still broken
why? cleared out /dev/shm by rebooting machines (it is not persisted)
decided to test manually

$ cd /data/bes3d2/bes3/users/howell/simulations/howell-T1/work
$ ./process 0
...
boss.exe /.../bhwide.boss.txt ...

this created bhwide.end (as could be seen from the rsync output)
presumably it succeeded?
checked /hdfs/bes3/…/mc, but howell-T1-0-…mn.rtraw wasn't there!
check job script to see if the command to move it was broken (remember we changed mv → cp)

$ cd /dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185
$ grep cp execute
    cp "/dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185/howell-T1-0-R11517-E185.mn.dst" "/hdfs/bes3/users/howell/simulations/howell-T1/dst/howell-T1-0-R11517-E185.mn.dst"
    cp "/dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185/howell-T1-0-R11517-E185.mn.root" "/hdfs/bes3/users/howell/simulations/howell-T1/root/howell-T1-0-R11517-E185.mn.root"
    cp "/dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185/howell-T1-0-R11517-E185.mn.skim" "/hdfs/bes3/users/howell/simulations/howell-T1/skim/howell-T1-0-R11517-E185.mn.skim"

no cp for the rtraw file! clearly must regenerate executables for /all/ jobs, as we can't trust any of them now.

$ cd /data/bes3d2/bes3/users/howell/simulations/howell-T1/work/jobs
$ rm */execute

Now re-run job 0

before 2011-05-23

Previous cascading failures:
- move to hadoop (mv <file> /hdfs/bes3/…/<file>) of output files was not removing the original
- causes /dev/shm (our staging area) to fill up (only 24G in size)
- causes farm to die (no staging space)
So we can't trust mv to hadoop; instead use cp to hadoop and then rm -f the file (rm -f /definitely/ removes the file)
mc updated correctly, but we still have this old farm which has so much progress; a shame to waste it
construct a sed command to update the howell-T1 job scripts

User Tools

howell-T1 farm

Configuration

Log

2011-05-27

2011-05-23

before 2011-05-23

Page Tools