(lines starting with $ are what I entered to recover this information)
$ mc list howell-T1 -c
mc new "howell-T1" --seed 696189625 --decay ecm3770 --number 4.915113e8 --actions simulate dst d-skim monitor --drop-actions simulate --output-directory /hdfs/bes3/users/howell/simulations/howell-T1
$ grep export /data/bes3d2/bes3/users/howell/simulations/howell-T1/work/process
Problem (appears to be) fixed in commit bb2f314196b3758a6ffb971636e56b9daa8f321c; it was caused by temporary output files being kept separate from the staging directory. After a job is finished, the staging directory gets cleaned up (if it succeeded), but the temporary output folder doesn't. We now only use the staging directory, and commit 668445192aa5659db52200c4773a43311d11ccd4 adds some insurance that it gets cleaned up and no data files get synced to the work directory.
Also added commit d147245991fe89f8cdfb3e3846d98e5145bec371, which keeps jobs which have completed from re-running. Jobs would never re-run a complete stage in the first place, but in the case of temporary stages (where the output isn't kept), it would try to re-run that in anticipation of non-temporary stages requiring the temporary output. Thus, this saves us a generation step in the event that reconstruction, etc., are all finished.
even after sed script, farm still broken
why? cleared out /dev/shm by rebooting machines (it is not persisted)
decided to test manually
$ cd /data/bes3d2/bes3/users/howell/simulations/howell-T1/work
$ ./process 0
boss.exe /.../bhwide.boss.txt ...
this created bhwide.end (as could be seen from the rsync output)
presumably it succeeded?
checked /hdfs/bes3/…/mc, but howell-T1-0-…mn.rtraw wasn't there!
check job script to see if the command to move it was broken (remember we changed mv → cp)
$ cd /dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185
$ grep cp execute
cp "/dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185/howell-T1-0-R11517-E185.mn.dst" "/hdfs/bes3/users/howell/simulations/howell-T1/dst/howell-T1-0-R11517-E185.mn.dst"
cp "/dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185/howell-T1-0-R11517-E185.mn.root" "/hdfs/bes3/users/howell/simulations/howell-T1/root/howell-T1-0-R11517-E185.mn.root"
cp "/dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185/howell-T1-0-R11517-E185.mn.skim" "/hdfs/bes3/users/howell/simulations/howell-T1/skim/howell-T1-0-R11517-E185.mn.skim"
$ cd /data/bes3d2/bes3/users/howell/simulations/howell-T1/work/jobs
$ rm */execute
Previous cascading failures:
move to hadoop (mv <file> /hdfs/bes3/…/<file>) of output files was not removing the original
causes /dev/shm (our staging area) to fill up (only 24G in size)
causes farm to die (no staging space)
So we can't trust mv to hadoop; instead use cp to hadoop and then rm -f the file (rm -f /definitely/ removes the file)
mc updated correctly, but we still have this old farm which has so much progress; a shame to waste it
construct a sed command to update the howell-T1 job scripts