Mantis Bugtracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000854 [Squeak] VM crash random 01-30-05 22:13 10-02-05 18:37
Reporter lewis View Status public  
Assigned To
Priority normal Resolution fixed  
Status closed   Product Version 3.9
Summary 0000854: Long Delay schedules a deferred image crash
Description If you schedule a Delay with a duration long enough to be a LargeInteger, it will sit happily in the SuspendedDelays queue until such time as it becomes the next Delay to be activated. At that time, #primitiveSignalAtMilliseconds is called. The primitive is only able to handle SmallIntegers, so it fails, whereupon #primSignal:atMilliseconds: attempts to throw 'a primitive has failed', which bolluxes up the works in the process scheduler. As far as I can tell, this is an unrecoverable error condition.

The following illustrates the problem and will usually crash your image:

  | bigDelay |
  bigDelay := Delay forSeconds: SmallInteger maxVal + 1.
  [bigDelay wait.
    Transcript show: 'I have been waiting a long time for this'; cr] fork.
  (Delay forSeconds: 1) wait.
  Smalltalk garbageCollect "=> image will hang after failed primitive"

Two things need to be addressed:

1) The primitive should accept large integer arguments. Note that the longest possible delay with a small integer is only about 12 days:
     Duration seconds: (SmallInteger maxVal / 1000) asInteger => 12:10:15:41

2) Delay class>>primSignal:atMilliseconds: should do something other than #primitiveFailed when the primitive fails. Perhaps writing something to the Transcript would make sense, or just ignore the failure entirely?
Additional Information This bug exists in all versions of Squeak.
Attached Files  noLongDelays-ls.1.cs [^] (1,323 bytes) 01-30-05 22:58
 DelayCrashFix-dtl.cs.gz [^] (1,067 bytes) 02-01-05 06:15
 LongDelay-dtl.cs.gz [^] (3,228 bytes) 02-03-05 01:45
 LongDelayTest-dtl.cs.gz [^] (1,101 bytes) 02-03-05 01:45
 CheckInitialDelayTime-854-v4-laza.2.cs [^] (3,766 bytes) 02-12-05 12:52

- Relationships

SYSTEM WARNING: Creating default object from empty value

SYSTEM WARNING: Creating default object from empty value

related to 0001644closed  [ENH] DelayRefactoring-st 
related to 0001840closed ducasse Long Delays and multiple delays with same resumption time 
related to 0006834closed laza Delay class>>timeoutSemaphore:afterMSecs: doesn't. 

- Notes
(0001113 - 551 - 575 - 575 - 575 - 575 - 575)
01-30-05 22:57

Asking the primitive to support long integers seems like a bit much to me. The better solution is make a LongDelay class which can handle huge delays. There would have to be a background process periodically checking for LongDelay's that are close to coming into effect, and schedule a regular Delay if so.

That's a lot of work. How about, for now, we have class Delay simply refuse to create long-waiting instances? It's a shame not to support long-running delays, but this would seem much better than the current situation.

Patch attached.
(0001114 - 649 - 673 - 673 - 673 - 673 - 673)
01-30-05 23:12

Correction: My suggestion of having the primitive accept LargeInteger objects is a bad idea. The VM is keeping track of milliseconds in an int variable, so at best we could increase the maximum duration of a Delay by a factor of 4, which does not fix anything really.

(Lex - funny, I was just thinking about the idea of LongDelay as soon as I realized that using LargeIntegers was a dumb idea. Maybe it's not so hard to do, it could just be something that holds on to a collection of regular Delays until they all expire.)

Having class Delay refuse to create instances with LargeInteger durations does sound like the right thing to do for now.
(0001115 - 251 - 269 - 269 - 269 - 269 - 269)
01-30-05 23:28

Lex, your fix won't catch too long delay times in the instance creation method forSeconds: in Delay.
In my variation of your fix I removed some redundant code and placed the bounds check on the instance side. Hopefully this catches all cases.

(0001117 - 508 - 520 - 520 - 520 - 520 - 520)
01-31-05 02:14

We should probably check for a LPI after a prim fail and do the appropriate modulo arithmetic to find out if the delay should fire in the next cycle round the ~12days of milliseconds.

Consider using an actual Date as a way to specify the firing time of a really long delay; it would at least make it very explicit that this is a Delay of immense proportions. Debugging would be a bit less hair-tearing if you could see at a glance that the Delay is meant to go off in a thousand years and not yesterday.
(0001118 - 828 - 987 - 987 - 987 - 987 - 987)
01-31-05 12:44

Oops, the proposed bounds check fixes will not work. The #primitiveSignalAtMilliseconds is called with the millisecond clock time at which we want the Delay to end. This means that a Delay of SmallInteger>>maxVal milliseconds will call the primitive with a LPI argument, which will trigger the system hang problem.

Note that this is also going to be platform dependent. If there are any platforms that do not roll the millisecond clock every 24 hours, they will behave differently with respect to the longest Delay that can be represented.

  | bigDelay |
  bigDelay := Delay forMilliseconds: SmallInteger maxVal.
  [bigDelay wait.
    Transcript show: 'I have been waiting a long time for this'; cr] fork.
  (Delay forSeconds: 1) wait.
  Smalltalk garbageCollect "=> image will hang after failed primitive"
(0001119 - 710 - 914 - 914 - 914 - 914 - 914)
01-31-05 18:43

    "Return the value of the millisecond clock as an integer. Note that the millisecond clock wraps around periodically. On some platforms it can wrap daily. The range is limited to SmallInteger maxVal / 2 to allow delays of up to that length without overflowing a SmallInteger."

Ok, so in V2 I lowered the upper range check down to SmallInteger maxVal // 2.

This seems to work:

  | bigDelay |
  bigDelay := Delay forMilliseconds: SmallInteger maxVal // 2 - 1.
  [bigDelay wait.
    Transcript show: 'I have been waiting a long time for this'; cr] fork.
  (Delay forSeconds: 1) wait.
  Smalltalk garbageCollect "=> image will hang after failed primitive"
(0001120 - 1068 - 1298 - 1298 - 1392 - 1392 - 1392)
02-01-05 06:18

Posted on squeak-dev, with attachment uploaded here:

Subject: [BUG][FIX] Long Delay crashes Morphic image
If an attempt to schedule a Delay fails, cancel the Delay and schedule and error notification (Morphic only). This corrects a problem in which an invalid Delay results in a locked image when running in Morphic.
An invalid Delay can be created by specifying a delay duration that overflows the size of a SmallInteger. For example, the following will crash a Morphic image if this change set has not been applied:
  | bigDelay |
  bigDelay := Delay forMilliseconds: SmallInteger maxVal + 1.
  bigDelay inspect.
  [bigDelay wait.
    Transcript show: 'I have been waiting a long time for this'; cr] fork.
  (Delay forSeconds: 1) wait.
  Smalltalk garbageCollect
This change set corrects the system crash problem reported as 0000854 on Mantis (Subject: [BUG][VM] long Delay schedules a deferred image crash).
It does not resolve the remaining problem that Delays of duration longer than about 12 days cannot be scheduled on Squeak.
(0001121 - 1211 - 1409 - 1409 - 1409 - 1409 - 1409)
02-01-05 09:20

I see some problems with your fix.

1) Error reporting is defered until the Delay is scheduled as the active delay

  | bigDelay |
    someDelay := Delay forMilliseconds: 24*60*60*1000.
    bigDelay := Delay forMilliseconds: SmallInteger maxVal + 1.
    [someDelay wait. Transcript show: 'Waited some time'; cr] fork.
    [bigDelay wait. Transcript show: 'I have been waiting a long time for this'; cr] fork.

With your fix the error will be reported earliest after a day, when someDelay gets unscheduled.

2) It makes the behaviour of Delay non deterministic, because delay times t > (SmallInteger maxVal // 2) will sometimes work and sometimes not, depending on the MillisecondClock value c and t+c<=Smalltalk maxVal

I think class Delay should be seen as to be used only for rather short delays, because it is based on a clock with millisecond precision. For anything that should be delayed days or weeks I think we should have some LongDelay class, wich could be based on a coarser clock and cooperates with Delay.

Now to be on the safe side of things, should we lower the upper bound for Delays even more down to 24h, because the Millisecond clock on some platforms wraps every 24h?
(0001122 - 526 - 560 - 560 - 560 - 560 - 560)
02-01-05 12:39

Right, DelayCrashFix-dtl is only intended to address the "image hangs" part of the problem. It does not address the problem of large Delays not working properly, and you are right that this is non-deterministic behavior.

IMO it is not acceptable to have a failure mode that hangs the system, so both problems should be fixed.

With respect to the millisecond clock question, I'm not exactly sure how this works on the various platorms and VMs. Your choice of SmallInteger//2 is probably safe, but I'm really not certain.
(0001123 - 103 - 103 - 103 - 103 - 103 - 103)
02-01-05 13:28

I'm not sure either, so I asked for some additional advice from some of the authors of the Delay class.
(0001127 - 429 - 488 - 488 - 488 - 488 - 488)
02-02-05 16:58

via email:

Hi -

Short version: Lex' solution is by far the best. Making the primitive LargeInt aware might be impossible; but one could easily think of having a LongDelay using (short) delays to re-schedule itself when appropriate (in fact, it would be doable to build this into the timer interrupt watcher itself).

In the absence of that fix, raising an error is by far preferrable to a VM crash.

 - Andreas
(0001128 - 329 - 354 - 354 - 354 - 354 - 354)
02-02-05 17:06

> Right, DelayCrashFix-dtl is only intended to address the "image hangs" part of the problem.

Ok, so I build a final changeset (V3) which combines the two approaches (bound check and crash prevention) and limited the longest possible delay time to 24h. Anything above that should be handled by the upcoming LongDelay class! :)
(0001130 - 1053 - 1098 - 1098 - 1098 - 1098 - 1098)
02-03-05 00:00

Guys, I think you're getting a bit off track here.

There is no need for any extra delay class; a Delay just needs a number and an LPI is fine. Floats and Fraction (and Complex and Tensors etc) are probably a poor idea for now. LPIs sort perfectly well in the delay list. No problems there.

What we need is to stop trying to schedule any Delays with non-SmallInt resumption times. Notice that we already have code to re-work the resumtption times when the clock rolls over (see Delay class>timerInterruptWatcher) so long delays will eventually work their way down to 'sensible' values.

The problem is in #activate IF the lowest value resumption time is >SmallInteger. At this point the thing to do is to create a substitute Delay with some longish delay and a new semaphore (for it to signal) All that will do is put off the evil moment for a while but during that time other Delays may be made and life should continue as normal. If the long delay gets to the front of the queue again and is still >SMallInteger then te same thing gets done.
(0001132 - 824 - 848 - 848 - 848 - 848 - 848)
02-03-05 01:56

I posted an implementation of LongDelay on the list, and have attached the change sets here also. This seems to do a fairly good job of implementing long delays. I tried to fail-safe it a bit by giving the delay watcher process a nice name so folks will not be tempted to terminate it in the ProcessBrowser.

Tim may have a better idea there, need to think about it. But I'm not sure how to do it without adding a Process to watch the delayTheInevitable semaphore. Once you have added a process to wait on the inevitable, I think you have about the same thing that I just did with LongDelay.

Notes: Class Monitor is subclassed from Delay, hence any image crash bug does not go away when LongDelay is added. The DelayCrashFix-dtl patch is still required as a safety net, regardless of how we end up handling long delays.
(0001143 - 1125 - 1179 - 1179 - 1179 - 1179 - 1179)
02-12-05 12:51

(back from carnival :)


I think we should fix the bug by guarding the primitive against wrong parameters and leave the Delay class as it is. For the majority of use cases the max delay time of about six days should be more than enough. Besides Dave's recent demand for extra long delays, I don't think I have seen any complains in the last seven years about the max Delay being too short. Extending the Delay class to support extra long delays by using substitute delays dosn't sound too good to me. Such a change would also mean a lot of testing (work) to get Delay rock solid and I don't think this is worth the benefits. (And I don't want to be the one who gets tared and feathered if the extended Delay later turns out to break in any way :)
I would say the few that *really* need extra long delays can just easily use Dave's changeset (or use a solution of their own).

That said, I provide just another fix, which isn't that restrictive. I've tested that even if I use a artificial crippled millisecondClock that wraps about every 8 seconds, delay times from 0 to (SmallInt // 2) do indeed work.

(0001144 - 101 - 101 - 101 - 101 - 101 - 101)
02-12-05 12:56

I think the issue has been resolved. Just feel free to change the status again if you don't think so.
(0001719 - 266 - 266 - 266 - 266 - 266 - 266)
07-08-05 18:23

I'm not sure I would count this issue to be resolved as long as it has not been 'fixed' in any existing image or update. Perhaps this appeared in the now abandoned 3.9a branch? At any rate this does not appear to be fixed in 3.8-6665 and so I'm going to reopen it.
(0002738 - 303 - 309 - 309 - 309 - 309 - 309)
09-30-05 09:53

putting this on resolved is a good idea: We need to be able to tag those items that are resolved, but not yet harvested. This makes the life of the harvesters *much* easier, because they can focus on that stuff that can be looked at. So: if fixed by a changeset --> resolved. If in the image --> closed.
(0002739 - 284 - 284 - 284 - 284 - 284 - 284)
09-30-05 18:00

Yes, OK. I did that back before I'd really considered how we would apply the statuses for our purposes. I agree that I should have simply left it as Resolved and that is the correct status for a report that has an attached (or referenced as being in a public 'inbox') fix that works.
(0002758 - 16 - 16 - 16 - 16 - 16 - 16)
10-02-05 18:36

in 3.9a script 5

- Issue History
Date Modified Username Field Change
01-30-05 22:13 lewis New Issue
01-30-05 22:57 lexspoon Note Added: 0001113
01-30-05 22:58 lexspoon File Added: noLongDelays-ls.1.cs
01-30-05 23:12 lewis Note Added: 0001114
01-30-05 23:28 laza Note Added: 0001115
01-30-05 23:28 laza File Added: CheckInitialDelayTime-854-v1-laza.1.cs
01-31-05 02:14 tim Note Added: 0001117
01-31-05 12:44 lewis Note Added: 0001118
01-31-05 18:43 laza Note Added: 0001119
01-31-05 18:44 laza File Added: CheckInitialDelayTime-854-v2-laza.1.cs
02-01-05 06:15 lewis File Added: DelayCrashFix-dtl.cs.gz
02-01-05 06:18 lewis Note Added: 0001120
02-01-05 09:20 laza Note Added: 0001121
02-01-05 12:39 lewis Note Added: 0001122
02-01-05 13:28 laza Note Added: 0001123
02-02-05 16:58 laza Note Added: 0001127
02-02-05 17:06 laza Note Added: 0001128
02-02-05 17:06 laza File Added: CheckInitialDelayTime-854-v3-laza.1.cs
02-02-05 17:08 laza File Deleted: CheckInitialDelayTime-854-v1-laza.1.cs
02-02-05 17:08 laza File Deleted: CheckInitialDelayTime-854-v2-laza.1.cs
02-03-05 00:00 tim Note Added: 0001130
02-03-05 01:45 lewis File Added: LongDelay-dtl.cs.gz
02-03-05 01:45 lewis File Added: LongDelayTest-dtl.cs.gz
02-03-05 01:56 lewis Note Added: 0001132
02-12-05 12:51 laza Note Added: 0001143
02-12-05 12:52 laza File Added: CheckInitialDelayTime-854-v4-laza.2.cs
02-12-05 12:52 laza File Deleted: CheckInitialDelayTime-854-v3-laza.1.cs
02-12-05 12:56 laza Status new => resolved
02-12-05 12:56 laza Resolution open => fixed
02-12-05 12:56 laza Assigned To  => laza
02-12-05 12:56 laza Note Added: 0001144
02-12-05 12:56 laza Assigned To laza =>
02-12-05 12:58 laza Issue Monitored: laza
07-08-05 18:23 KenCausey Status resolved => feedback
07-08-05 18:23 KenCausey Resolution fixed => reopened
07-08-05 18:23 KenCausey Note Added: 0001719
09-16-05 01:33 laza Relationship added related to 0001644
09-30-05 09:53 MarcusDenker Status feedback => resolved
09-30-05 09:53 MarcusDenker Resolution reopened => fixed
09-30-05 09:53 MarcusDenker Assigned To  => MarcusDenker
09-30-05 09:53 MarcusDenker Note Added: 0002738
09-30-05 09:54 MarcusDenker Assigned To MarcusDenker =>
09-30-05 18:00 KenCausey Note Added: 0002739
10-02-05 18:36 ducasse Status resolved => closed
10-02-05 18:36 ducasse Note Added: 0002758
10-05-05 01:33 tim Relationship added related to 0001840
01-05-08 07:21 laza Relationship added related to 0006834
01-05-08 14:51 lewis Issue Monitored: lewis

Mantis 1.0.8[^]
Copyright © 2000 - 2007 Mantis Group
163 total queries executed.
79 unique queries executed.
Powered by Mantis Bugtracker