Differences

This shows you the differences between two versions of the page.

--- groups:bes3:users:howell:farm:howell-t1 [2011/05/23 13:01] – created howell
+++ groups:bes3:users:howell:farm:howell-t1 [2011/05/27 11:57] (current) – howell
@@ Line 21: / Line 21: @@
 === Log ===
-  * 2011-05-23
+== 2011-05-27 ==
-    * even after sed script, farm still broken
+  * Problem (appears to be) fixed in commit bb2f314196b3758a6ffb971636e56b9daa8f321c; it was caused by temporary output files being kept separate from the staging directory. After a job is finished, the staging directory gets cleaned up (if it succeeded), but the temporary output folder doesn't. We now only use the staging directory, and commit 668445192aa5659db52200c4773a43311d11ccd4 adds some insurance that it gets cleaned up and no data files get synced to the work directory.
-    * why? cleared out /dev/shm by rebooting machines (it is not persisted)
+  * Also added commit d147245991fe89f8cdfb3e3846d98e5145bec371, which keeps jobs which have completed from re-running. Jobs would never re-run a complete stage in the first place, but in the case of temporary stages (where the output isn't kept), it would try to re-run that in anticipation of non-temporary stages requiring the temporary output. Thus, this saves us a generation step in the event that reconstruction, etc., are all finished.
-    * decided to test manually
+== 2011-05-23 ==
+  * even after sed script, farm still broken
+  * why? cleared out /dev/shm by rebooting machines (it is not persisted)
+  * decided to test manually
   $ cd /data/bes3d2/bes3/users/howell/simulations/howell-T1/work
@@ Line 31: / Line 34: @@
   boss.exe /.../bhwide.boss.txt ...
-    * this created bhwide.end (as could be seen from the rsync output)
+  * this created bhwide.end (as could be seen from the rsync output)
-    * presumably it succeeded?
+  * presumably it succeeded?
-    * checked /hdfs/bes3/.../mc, but howell-T1-0-...mn.rtraw wasn't there!
+  * checked /hdfs/bes3/.../mc, but howell-T1-0-...mn.rtraw wasn't there!
-    * check job script to see if the command to move it was broken (remember we changed mv -> cp)
+  * check job script to see if the command to move it was broken (remember we changed mv -> cp)
   $ cd /dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185
   $ grep cp execute
@@ Line 41: / Line 45: @@
       cp "/dev/shm/staging-howell/howell-T1/howell-T1-0-R11517-E185/howell-T1-0-R11517-E185.mn.skim" "/hdfs/bes3/users/howell/simulations/howell-T1/skim/howell-T1-0-R11517-E185.mn.skim"
-    * no cp for the rtraw file!
+  * no cp for the rtraw file! clearly must regenerate executables for /all/ jobs, as we can't trust any of them now.
+  $ cd /data/bes3d2/bes3/users/howell/simulations/howell-T1/work/jobs
+  $ rm */execute
+  * Now re-run job 0
-  * before 2011-05-23
+== before 2011-05-23 ==
-    * Previous cascading failures:
+  * Previous cascading failures:
-      * move to hadoop (mv <file> /hdfs/bes3/.../<file>) of output files was not removing the original
+    * move to hadoop (mv <file> /hdfs/bes3/.../<file>) of output files was not removing the original
-      * causes /dev/shm (our staging area) to fill up (only 24G in size)
+    * causes /dev/shm (our staging area) to fill up (only 24G in size)
-      * causes farm to die (no staging space)
+    * causes farm to die (no staging space)
-    * So we can't trust mv to hadoop; instead use cp to hadoop and then rm -f the file (rm -f /definitely/ removes the file)
+  * So we can't trust mv to hadoop; instead use cp to hadoop and then rm -f the file (rm -f /definitely/ removes the file)
-    * mc updated correctly, but we still have this old farm which has so much progress; a shame to waste it
+  * mc updated correctly, but we still have this old farm which has so much progress; a shame to waste it
-    * construct a sed command to update the howell-T1 job scripts
+  * construct a sed command to update the howell-T1 job scripts

User Tools

Differences

Page Tools