|
@@ -0,0 +1,1160 @@
|
|
|
|
+PPoossttffiixx QQuueeuuee SScchheedduulleerr
|
|
|
|
+
|
|
|
|
+-------------------------------------------------------------------------------
|
|
|
|
+
|
|
|
|
+DDiissccllaaiimmeerr
|
|
|
|
+
|
|
|
|
+Many of the transport-specific configuration parameters discussed in this
|
|
|
|
+document will not show up in "postconf" command output before Postfix version
|
|
|
|
+2.9. This limitation applies to many parameters whose name is a combination of
|
|
|
|
+a master.cf service name such as "relay" and a built-in suffix such as
|
|
|
|
+"_destination_concurrency_limit".
|
|
|
|
+
|
|
|
|
+OOvveerrvviieeww
|
|
|
|
+
|
|
|
|
+The queue manager is by far the most complex part of the Postfix mail system.
|
|
|
|
+It schedules delivery of new mail, retries failed deliveries at specific times,
|
|
|
|
+and removes mail from the queue after the last delivery attempt. There are two
|
|
|
|
+major classes of mechanisms that control the operation of the queue manager.
|
|
|
|
+
|
|
|
|
+Topics covered by this document:
|
|
|
|
+
|
|
|
|
+ * Concurrency scheduling, concerned with the number of concurrent deliveries
|
|
|
|
+ to a specific destination, including decisions on when to suspend
|
|
|
|
+ deliveries after persistent failures.
|
|
|
|
+ * Preemptive scheduling, concerned with the selection of email messages and
|
|
|
|
+ recipients for a given destination.
|
|
|
|
+ * Credits, something this document would not be complete without.
|
|
|
|
+
|
|
|
|
+CCoonnccuurrrreennccyy sscchheedduulliinngg
|
|
|
|
+
|
|
|
|
+The following sections document the Postfix 2.5 concurrency scheduler, after a
|
|
|
|
+discussion of the limitations of the earlier concurrency scheduler. This is
|
|
|
|
+followed by results of medium-concurrency experiments, and a discussion of
|
|
|
|
+trade-offs between performance and robustness.
|
|
|
|
+
|
|
|
|
+The material is organized as follows:
|
|
|
|
+
|
|
|
|
+ * Drawbacks of the existing concurrency scheduler
|
|
|
|
+ * Summary of the Postfix 2.5 concurrency feedback algorithm
|
|
|
|
+ * Summary of the Postfix 2.5 "dead destination" detection algorithm
|
|
|
|
+ * Pseudocode for the Postfix 2.5 concurrency scheduler
|
|
|
|
+ * Results for delivery to concurrency limited servers
|
|
|
|
+ * Discussion of concurrency limited server results
|
|
|
|
+ * Limitations of less-than-1 per delivery feedback
|
|
|
|
+ * Concurrency configuration parameters
|
|
|
|
+
|
|
|
|
+DDrraawwbbaacckkss ooff tthhee eexxiissttiinngg ccoonnccuurrrreennccyy sscchheedduulleerr
|
|
|
|
+
|
|
|
|
+From the start, Postfix has used a simple but robust algorithm where the per-
|
|
|
|
+destination delivery concurrency is decremented by 1 after delivery failed due
|
|
|
|
+to connection or handshake failure, and incremented by 1 otherwise. Of course
|
|
|
|
+the concurrency is never allowed to exceed the maximum per-destination
|
|
|
|
+concurrency limit. And when a destination's concurrency level drops to zero,
|
|
|
|
+the destination is declared "dead" and delivery is suspended.
|
|
|
|
+
|
|
|
|
+Drawbacks of +/-1 concurrency feedback per delivery are:
|
|
|
|
+
|
|
|
|
+ * Overshoot due to exponential delivery concurrency growth with each pseudo-
|
|
|
|
+ cohort(*). This can be an issue with high-concurrency channels. For
|
|
|
|
+ example, with the default initial concurrency of 5, concurrency would
|
|
|
|
+ proceed over time as (5-10-20).
|
|
|
|
+
|
|
|
|
+ * Throttling down to zero concurrency after a single pseudo-cohort(*)
|
|
|
|
+ failure. This was especially an issue with low-concurrency channels where a
|
|
|
|
+ single failure could be sufficient to mark a destination as "dead", causing
|
|
|
|
+ the suspension of further deliveries to the affected destination.
|
|
|
|
+
|
|
|
|
+(*) A pseudo-cohort is a number of delivery requests equal to a destination's
|
|
|
|
+delivery concurrency.
|
|
|
|
+
|
|
|
|
+The revised concurrency scheduler has a highly modular structure. It uses
|
|
|
|
+separate mechanisms for per-destination concurrency control and for "dead
|
|
|
|
+destination" detection. The concurrency control in turn is built from two
|
|
|
|
+separate mechanisms: it supports less-than-1 feedback per delivery to allow for
|
|
|
|
+more gradual concurrency adjustments, and it uses feedback hysteresis to
|
|
|
|
+suppress concurrency oscillations. And instead of waiting for delivery
|
|
|
|
+concurrency to throttle down to zero, a destination is declared "dead" after a
|
|
|
|
+configurable number of pseudo-cohorts reports connection or handshake failure.
|
|
|
|
+
|
|
|
|
+SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy ffeeeeddbbaacckk aallggoorriitthhmm
|
|
|
|
+
|
|
|
|
+We want to increment a destination's delivery concurrency when some (not
|
|
|
|
+necessarily consecutive) number of deliveries complete without connection or
|
|
|
|
+handshake failure. This is implemented with positive feedback g(N) where N is
|
|
|
|
+the destination's delivery concurrency. With g(N)=1 feedback per delivery,
|
|
|
|
+concurrency increases by 1 after each positive feedback event; this gives us
|
|
|
|
+the old scheduler's exponential growth in time. With g(N)=1/N feedback per
|
|
|
|
+delivery, concurrency increases by 1 after an entire pseudo-cohort N of
|
|
|
|
+positive feedback reports; this gives us linear growth in time. Less-than-
|
|
|
|
+1 feedback per delivery and integer truncation naturally give us hysteresis, so
|
|
|
|
+that transitions to larger concurrency happen every 1/g(N) positive feedback
|
|
|
|
+events.
|
|
|
|
+
|
|
|
|
+We want to decrement a destination's delivery concurrency when some (not
|
|
|
|
+necessarily consecutive) number of deliveries complete after connection or
|
|
|
|
+handshake failure. This is implemented with negative feedback f(N) where N is
|
|
|
|
+the destination's delivery concurrency. With f(N)=1 feedback per delivery,
|
|
|
|
+concurrency decreases by 1 after each negative feedback event; this gives us
|
|
|
|
+the old scheduler's behavior where concurrency is throttled down dramatically
|
|
|
|
+after a single pseudo-cohort failure. With f(N)=1/N feedback per delivery,
|
|
|
|
+concurrency backs off more gently. Again, less-than-1 feedback per delivery and
|
|
|
|
+integer truncation naturally give us hysteresis, so that transitions to lower
|
|
|
|
+concurrency happen every 1/f(N) negative feedback events.
|
|
|
|
+
|
|
|
|
+However, with negative feedback we introduce a subtle twist. We "reverse" the
|
|
|
|
+negative hysteresis cycle so that the transition to lower concurrency happens
|
|
|
|
+at the bbeeggiinnnniinngg of a sequence of 1/f(N) negative feedback events. Otherwise, a
|
|
|
|
+correction for overload would be made too late. This makes the choice of f(N)
|
|
|
|
+relatively unimportant, as borne out by measurements later in this document.
|
|
|
|
+
|
|
|
|
+In summary, the main ingredients for the Postfix 2.5 concurrency feedback
|
|
|
|
+algorithm are a) the option of less-than-1 positive feedback per delivery to
|
|
|
|
+avoid overwhelming servers, b) the option of less-than-1 negative feedback per
|
|
|
|
+delivery to avoid giving up too fast, c) feedback hysteresis to avoid rapid
|
|
|
|
+oscillation, and d) a "reverse" hysteresis cycle for negative feedback, so that
|
|
|
|
+it can correct for overload quickly.
|
|
|
|
+
|
|
|
|
+SSuummmmaarryy ooff tthhee PPoossttffiixx 22..55 ""ddeeaadd ddeessttiinnaattiioonn"" ddeetteeccttiioonn aallggoorriitthhmm
|
|
|
|
+
|
|
|
|
+We want to suspend deliveries to a specific destination after some number of
|
|
|
|
+deliveries suffers connection or handshake failure. The old scheduler declares
|
|
|
|
+a destination "dead" when negative (-1) feedback throttles the delivery
|
|
|
|
+concurrency down to zero. With less-than-1 feedback per delivery, this
|
|
|
|
+throttling down would obviously take too long. We therefore have to separate
|
|
|
|
+"dead destination" detection from concurrency feedback. This is implemented by
|
|
|
|
+introducing the concept of pseudo-cohort failure. The Postfix 2.5 concurrency
|
|
|
|
+scheduler declares a destination "dead" after a configurable number of pseudo-
|
|
|
|
+cohorts suffers from connection or handshake failures. The old scheduler
|
|
|
|
+corresponds to the special case where the pseudo-cohort failure limit is equal
|
|
|
|
+to 1.
|
|
|
|
+
|
|
|
|
+PPsseeuuddooccooddee ffoorr tthhee PPoossttffiixx 22..55 ccoonnccuurrrreennccyy sscchheedduulleerr
|
|
|
|
+
|
|
|
|
+The pseudo code shows how the ideas behind new concurrency scheduler are
|
|
|
|
+implemented as of November 2007. The actual code can be found in the module
|
|
|
|
+qmgr/qmgr_queue.c.
|
|
|
|
+
|
|
|
|
+Types:
|
|
|
|
+ Each destination has one set of the following variables
|
|
|
|
+ int concurrency
|
|
|
|
+ double success
|
|
|
|
+ double failure
|
|
|
|
+ double fail_cohorts
|
|
|
|
+
|
|
|
|
+Feedback functions:
|
|
|
|
+ N is concurrency; x, y are arbitrary numbers in [0..1] inclusive
|
|
|
|
+ positive feedback: g(N) = x/N | x/sqrt(N) | x
|
|
|
|
+ negative feedback: f(N) = y/N | y/sqrt(N) | y
|
|
|
|
+
|
|
|
|
+Initialization:
|
|
|
|
+ concurrency = initial_concurrency
|
|
|
|
+ success = 0
|
|
|
|
+ failure = 0
|
|
|
|
+ fail_cohorts = 0
|
|
|
|
+
|
|
|
|
+After success:
|
|
|
|
+ fail_cohorts = 0
|
|
|
|
+ Be prepared for feedback > hysteresis, or rounding error
|
|
|
|
+ success += g(concurrency)
|
|
|
|
+ while (success >= 1) Hysteresis 1
|
|
|
|
+ concurrency += 1 Hysteresis 1
|
|
|
|
+ failure = 0
|
|
|
|
+ success -= 1 Hysteresis 1
|
|
|
|
+ Be prepared for overshoot
|
|
|
|
+ if (concurrency > concurrency limit)
|
|
|
|
+ concurrency = concurrency limit
|
|
|
|
+
|
|
|
|
+Safety:
|
|
|
|
+ Don't apply positive feedback unless
|
|
|
|
+ concurrency < busy_refcount + init_dest_concurrency
|
|
|
|
+ otherwise negative feedback effect could be delayed
|
|
|
|
+
|
|
|
|
+After failure:
|
|
|
|
+ if (concurrency > 0)
|
|
|
|
+ fail_cohorts += 1.0 / concurrency
|
|
|
|
+ if (fail_cohorts > cohort_failure_limit)
|
|
|
|
+ concurrency = 0
|
|
|
|
+ if (concurrency > 0)
|
|
|
|
+ Be prepared for feedback > hysteresis, rounding errors
|
|
|
|
+ failure -= f(concurrency)
|
|
|
|
+ while (failure < 0)
|
|
|
|
+ concurrency -= 1 Hysteresis 1
|
|
|
|
+ failure += 1 Hysteresis 1
|
|
|
|
+ success = 0
|
|
|
|
+ Be prepared for overshoot
|
|
|
|
+ if (concurrency < 1)
|
|
|
|
+ concurrency = 1
|
|
|
|
+
|
|
|
|
+RReessuullttss ffoorr ddeelliivveerryy ttoo ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerrss
|
|
|
|
+
|
|
|
|
+Discussions about the concurrency scheduler redesign started early 2004, when
|
|
|
|
+the primary goal was to find alternatives that did not exhibit exponential
|
|
|
|
+growth or rapid concurrency throttling. No code was implemented until late
|
|
|
|
+2007, when the primary concern had shifted towards better handling of server
|
|
|
|
+concurrency limits. For this reason we measure how well the new scheduler does
|
|
|
|
+this job. The table below compares mail delivery performance of the old +/-
|
|
|
|
+1 feedback per delivery with several less-than-1 feedback functions, for
|
|
|
|
+different limited-concurrency server scenarios. Measurements were done with a
|
|
|
|
+FreeBSD 6.2 client and with FreeBSD 6.2 and various Linux servers.
|
|
|
|
+
|
|
|
|
+Server configuration:
|
|
|
|
+
|
|
|
|
+ * The mail flow was slowed down with 1 second latency per recipient
|
|
|
|
+ ("smtpd_client_restrictions = sleep 1"). The purpose was to make results
|
|
|
|
+ less dependent on hardware details, by avoiding slow-downs by queue file I/
|
|
|
|
+ O, logging I/O, and network I/O.
|
|
|
|
+ * Concurrency was limited by the server process limit ("default_process_limit
|
|
|
|
+ = 5" and "smtpd_client_event_limit_exceptions = static:all"). Postfix was
|
|
|
|
+ stopped and started after changing the process limit, because the same
|
|
|
|
+ number is also used as the backlog argument to the listen(2) system call,
|
|
|
|
+ and "postfix reload" does not re-issue this call.
|
|
|
|
+ * Mail was discarded with "local_recipient_maps = static:all" and
|
|
|
|
+ "local_transport = discard". The discard action in access maps or header/
|
|
|
|
+ body checks could not be used as it fails to update the in_flow_delay
|
|
|
|
+ counters.
|
|
|
|
+
|
|
|
|
+Client configuration:
|
|
|
|
+
|
|
|
|
+ * Queue file overhead was minimized by sending one message to a virtual alias
|
|
|
|
+ that expanded into 2000 different remote recipients. All recipients were
|
|
|
|
+ accounted for according to the maillog file. The
|
|
|
|
+ virtual_alias_expansion_limit setting was increased to avoid complaints
|
|
|
|
+ from the cleanup(8) server.
|
|
|
|
+ * The number of deliveries was maximized with
|
|
|
|
+ "smtp_destination_recipient_limit = 2". A smaller limit would cause Postfix
|
|
|
|
+ to schedule the concurrency per recipient instead of domain, which is not
|
|
|
|
+ what we want.
|
|
|
|
+ * Maximum concurrency was limited with "smtp_destination_concurrency_limit =
|
|
|
|
+ 20", and initial_destination_concurrency was set to the same value.
|
|
|
|
+ * The positive and negative concurrency feedback hysteresis was 1.
|
|
|
|
+ Concurrency was incremented by 1 at the END of 1/feedback steps of positive
|
|
|
|
+ feedback, and was decremented by 1 at the START of 1/feedback steps of
|
|
|
|
+ negative feedback.
|
|
|
|
+ * The SMTP client used the default 30s SMTP connect timeout and 300s SMTP
|
|
|
|
+ greeting timeout.
|
|
|
|
+
|
|
|
|
+IImmppaacctt ooff tthhee 3300ss SSMMTTPP ccoonnnneecctt ttiimmeeoouutt
|
|
|
|
+
|
|
|
|
+The first results are for a FreeBSD 6.2 server, where our artificially low
|
|
|
|
+listen(2) backlog results in a very short kernel queue for established
|
|
|
|
+connections. The table shows that all deferred deliveries failed due to a 30s
|
|
|
|
+connection timeout, and none failed due to a server greeting timeout. This
|
|
|
|
+measurement simulates what happens when the server's connection queue is
|
|
|
|
+completely full under load, and the TCP engine drops new connections.
|
|
|
|
+
|
|
|
|
+ cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt ttiimmeedd--oouutt iinn
|
|
|
|
+ lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ccoonnnneecctt//
|
|
|
|
+ aavveerraaggee//ssttddddeevv ggrreeeettiinngg
|
|
|
|
+
|
|
|
|
+ -------------------------------------------------------------------------
|
|
|
|
+ 20 5 1/N no 9.9 19.4 0.49 198 -
|
|
|
|
+
|
|
|
|
+ 20 5 1/N yes 10.3 19.4 0.49 206 -
|
|
|
|
+
|
|
|
|
+ 20 5 1/sqrt(N) no 10.4 19.6 0.59 208 -
|
|
|
|
+
|
|
|
|
+ 20 5 1/sqrt(N) yes 10.6 19.6 0.61 212 -
|
|
|
|
+
|
|
|
|
+ 20 5 1 no 10.1 19.5 1.29 202 -
|
|
|
|
+
|
|
|
|
+ 20 5 1 yes 10.8 19.3 1.57 216 -
|
|
|
|
+
|
|
|
|
+ -------------------------------------------------------------------------
|
|
|
|
+
|
|
|
|
+ A busy server with a completely full connection queue. N is the client
|
|
|
|
+ delivery concurrency. Failed deliveries time out after 30s without
|
|
|
|
+ completing the TCP handshake. See text for a discussion of results.
|
|
|
|
+
|
|
|
|
+IImmppaacctt ooff tthhee 330000ss SSMMTTPP ggrreeeettiinngg ttiimmeeoouutt
|
|
|
|
+
|
|
|
|
+The next table shows results for a Fedora Core 8 server (results for RedHat 7.3
|
|
|
|
+are identical). In this case, the artificially small listen(2) backlog argument
|
|
|
|
+does not impact our measurement. The table shows that practically all deferred
|
|
|
|
+deliveries fail after the 300s SMTP greeting timeout. As these timeouts were
|
|
|
|
+10x longer than with the first measurement, we increased the recipient count
|
|
|
|
+(and thus the running time) by a factor of 10 to keep the results comparable.
|
|
|
|
+The deferred mail percentages are a factor 10 lower than with the first
|
|
|
|
+measurement, because the 1s per-recipient delay was 1/300th of the greeting
|
|
|
|
+timeout instead of 1/30th of the connection timeout.
|
|
|
|
+
|
|
|
|
+ cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt ttiimmeedd--oouutt iinn
|
|
|
|
+ lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ccoonnnneecctt//
|
|
|
|
+ aavveerraaggee//ssttddddeevv ggrreeeettiinngg
|
|
|
|
+
|
|
|
|
+ -------------------------------------------------------------------------
|
|
|
|
+ 20 5 1/N no 1.16 19.8 0.37 - 230
|
|
|
|
+
|
|
|
|
+ 20 5 1/N yes 1.36 19.8 0.36 - 272
|
|
|
|
+
|
|
|
|
+ 20 5 1/sqrt(N) no 1.21 19.9 0.23 4 238
|
|
|
|
+
|
|
|
|
+ 20 5 1/sqrt(N) yes 1.36 20.0 0.23 - 272
|
|
|
|
+
|
|
|
|
+ 20 5 1 no 1.18 20.0 0.16 - 236
|
|
|
|
+
|
|
|
|
+ 20 5 1 yes 1.39 20.0 0.16 - 278
|
|
|
|
+
|
|
|
|
+ -------------------------------------------------------------------------
|
|
|
|
+
|
|
|
|
+ A busy server with a non-full connection queue. N is the client delivery
|
|
|
|
+ concurrency. Failed deliveries complete at the TCP level, but time out
|
|
|
|
+ after 300s while waiting for the SMTP greeting. See text for a discussion
|
|
|
|
+ of results.
|
|
|
|
+
|
|
|
|
+IImmppaacctt ooff aaccttiivvee sseerrvveerr ccoonnccuurrrreennccyy lliimmiitteerr
|
|
|
|
+
|
|
|
|
+The final concurrency-limited result shows what happens when SMTP connections
|
|
|
|
+don't time out, but are rejected immediately with the Postfix server's
|
|
|
|
+smtpd_client_connection_count_limit feature (the server replies with a 421
|
|
|
|
+status and disconnects immediately). Similar results can be expected with
|
|
|
|
+concurrency limiting features built into other MTAs or firewalls. For this
|
|
|
|
+measurement we specified a server concurrency limit and a client initial
|
|
|
|
+destination concurrency of 5, and a server process limit of 10; all other
|
|
|
|
+conditions were the same as with the first measurement. The same result would
|
|
|
|
+be obtained with a FreeBSD or Linux server, because the "pushing back" is done
|
|
|
|
+entirely by the receiving side.
|
|
|
|
+
|
|
|
|
+ cclliieenntt sseerrvveerr ffeeeeddbbaacckk ccoonnnneeccttiioonn ppeerrcceennttaaggee cclliieenntt tthheeoorreettiiccaall
|
|
|
|
+ lliimmiitt lliimmiitt ssttyyllee ccaacchhiinngg ddeeffeerrrreedd ccoonnccuurrrreennccyy ddeeffeerr rraattee
|
|
|
|
+ aavveerraaggee//ssttddddeevv
|
|
|
|
+
|
|
|
|
+ -------------------------------------------------------------------------
|
|
|
|
+ 20 5 1/N no 16.5 5.17 0.38 1/6
|
|
|
|
+
|
|
|
|
+ 20 5 1/N yes 16.5 5.17 0.38 1/6
|
|
|
|
+
|
|
|
|
+ 20 5 1/sqrt(N) no 24.5 5.28 0.45 1/4
|
|
|
|
+
|
|
|
|
+ 20 5 1/sqrt(N) yes 24.3 5.28 0.46 1/4
|
|
|
|
+
|
|
|
|
+ 20 5 1 no 49.7 5.63 0.67 1/2
|
|
|
|
+
|
|
|
|
+ 20 5 1 yes 49.7 5.68 0.70 1/2
|
|
|
|
+
|
|
|
|
+ -------------------------------------------------------------------------
|
|
|
|
+
|
|
|
|
+ A server with active per-client concurrency limiter that replies with 421
|
|
|
|
+ and disconnects. N is the client delivery concurrency. The theoretical
|
|
|
|
+ defer rate is 1/(1+roundup(1/feedback)). This is always 1/2 with the fixed
|
|
|
|
+ +/-1 feedback per delivery; with the concurrency-dependent feedback
|
|
|
|
+ variants, the defer rate decreases with increasing concurrency. See text
|
|
|
|
+ for a discussion of results.
|
|
|
|
+
|
|
|
|
+DDiissccuussssiioonn ooff ccoonnccuurrrreennccyy--lliimmiitteedd sseerrvveerr rreessuullttss
|
|
|
|
+
|
|
|
|
+All results in the previous sections are based on the first delivery runs only;
|
|
|
|
+they do not include any second etc. delivery attempts. It's also worth noting
|
|
|
|
+that the measurements look at steady-state behavior only. They don't show what
|
|
|
|
+happens when the client starts sending at a much higher or lower concurrency.
|
|
|
|
+
|
|
|
|
+The first two examples show that the effect of feedback is negligible when
|
|
|
|
+concurrency is limited due to congestion. This is because the initial
|
|
|
|
+concurrency is already at the client's concurrency maximum, and because there
|
|
|
|
+is 10-100 times more positive than negative feedback. Under these conditions,
|
|
|
|
+it is no surprise that the contribution from SMTP connection caching is also
|
|
|
|
+negligible.
|
|
|
|
+
|
|
|
|
+In the last example, the old +/-1 feedback per delivery will defer 50% of the
|
|
|
|
+mail when confronted with an active (anvil-style) server concurrency limit,
|
|
|
|
+where the server hangs up immediately with a 421 status (a TCP-level RST would
|
|
|
|
+have the same result). Less aggressive feedback mechanisms fare better than
|
|
|
|
+more aggressive ones. Concurrency-dependent feedback fares even better at
|
|
|
|
+higher concurrencies than shown here, but has limitations as discussed in the
|
|
|
|
+next section.
|
|
|
|
+
|
|
|
|
+LLiimmiittaattiioonnss ooff lleessss--tthhaann--11 ppeerr ddeelliivveerryy ffeeeeddbbaacckk
|
|
|
|
+
|
|
|
|
+Less-than-1 feedback is of interest primarily when sending large amounts of
|
|
|
|
+mail to destinations with active concurrency limiters (servers that reply with
|
|
|
|
+421, or firewalls that send RST). When sending small amounts of mail per
|
|
|
|
+destination, less-than-1 per-delivery feedback won't have a noticeable effect
|
|
|
|
+on the per-destination concurrency, because the number of deliveries to the
|
|
|
|
+same destination is too small. You might just as well use zero per-delivery
|
|
|
|
+feedback and stay with the initial per-destination concurrency. And when mail
|
|
|
|
+deliveries fail due to congestion instead of active concurrency limiters, the
|
|
|
|
+measurements above show that per-delivery feedback has no effect. With large
|
|
|
|
+amounts of mail you might just as well use zero per-delivery feedback and start
|
|
|
|
+with the maximal per-destination concurrency.
|
|
|
|
+
|
|
|
|
+The scheduler with less-than-1 concurrency feedback per delivery solves a
|
|
|
|
+problem with servers that have active concurrency limiters. This works only
|
|
|
|
+because feedback is handled in a peculiar manner: positive feedback will
|
|
|
|
+increment the concurrency by 1 at the eenndd of a sequence of events of length 1/
|
|
|
|
+feedback, while negative feedback will decrement concurrency by 1 at the
|
|
|
|
+bbeeggiinnnniinngg of such a sequence. This is how Postfix adjusts quickly for overshoot
|
|
|
|
+without causing lots of mail to be deferred. Without this difference in
|
|
|
|
+feedback treatment, less-than-1 feedback per delivery would defer 50% of the
|
|
|
|
+mail, and would be no better in this respect than the old +/-1 feedback per
|
|
|
|
+delivery.
|
|
|
|
+
|
|
|
|
+Unfortunately, the same feature that corrects quickly for concurrency overshoot
|
|
|
|
+also makes the scheduler more sensitive for noisy negative feedback. The reason
|
|
|
|
+is that one lonely negative feedback event has the same effect as a complete
|
|
|
|
+sequence of length 1/feedback: in both cases delivery concurrency is dropped by
|
|
|
|
+1 immediately. As a worst-case scenario, consider multiple servers behind a
|
|
|
|
+load balancer on a single IP address, and no backup MX address. When 1 out of K
|
|
|
|
+servers fails to complete the SMTP handshake or drops the connection, a
|
|
|
|
+scheduler with 1/N (N = concurrency) feedback stops increasing its concurrency
|
|
|
|
+once it reaches a concurrency level of about K, even though the good servers
|
|
|
|
+behind the load balancer are perfectly capable of handling more traffic.
|
|
|
|
+
|
|
|
|
+This noise problem gets worse as the amount of positive feedback per delivery
|
|
|
|
+gets smaller. A compromise is to use fixed less-than-1 positive feedback values
|
|
|
|
+instead of concurrency-dependent positive feedback. For example, to tolerate 1
|
|
|
|
+of 4 bad servers in the above load balancer scenario, use positive feedback of
|
|
|
|
+1/4 per "good" delivery (no connect or handshake error), and use an equal or
|
|
|
|
+smaller amount of negative feedback per "bad" delivery. The downside of using
|
|
|
|
+concurrency-independent feedback is that some of the old +/-1 feedback problems
|
|
|
|
+will return at large concurrencies. Sites that must deliver mail at non-trivial
|
|
|
|
+per-destination concurrencies will require special configuration.
|
|
|
|
+
|
|
|
|
+CCoonnccuurrrreennccyy ccoonnffiigguurraattiioonn ppaarraammeetteerrss
|
|
|
|
+
|
|
|
|
+The Postfix 2.5 concurrency scheduler is controlled with the following
|
|
|
|
+configuration parameters, where "transport_foo" provides a transport-specific
|
|
|
|
+parameter override. All parameter default settings are compatible with earlier
|
|
|
|
+Postfix versions.
|
|
|
|
+
|
|
|
|
+ PPaarraammeetteerr nnaammee PPoossttffiixx DDeessccrriippttiioonn
|
|
|
|
+ vveerrssiioonn
|
|
|
|
+
|
|
|
|
+ ---------------------------------------------------------------------------
|
|
|
|
+ Initial per-
|
|
|
|
+ initial_destination_concurrency all destination
|
|
|
|
+ transport_initial_destination_concurrency 2.5 delivery
|
|
|
|
+ concurrency
|
|
|
|
+
|
|
|
|
+ Maximum per-
|
|
|
|
+ default_destination_concurrency_limit all destination
|
|
|
|
+ transport_destination_concurrency_limit all delivery
|
|
|
|
+ concurrency
|
|
|
|
+
|
|
|
|
+ Per-
|
|
|
|
+ destination
|
|
|
|
+ positive
|
|
|
|
+ feedback
|
|
|
|
+ default_destination_concurrency_positive_feedback 2.5 amount, per
|
|
|
|
+ transport_destination_concurrency_positive_feedback 2.5 delivery that
|
|
|
|
+ does not fail
|
|
|
|
+ with
|
|
|
|
+ connection or
|
|
|
|
+ handshake
|
|
|
|
+ failure
|
|
|
|
+
|
|
|
|
+ Per-
|
|
|
|
+ destination
|
|
|
|
+ negative
|
|
|
|
+ feedback
|
|
|
|
+ default_destination_concurrency_negative_feedback 2.5 amount, per
|
|
|
|
+ transport_destination_concurrency_negative_feedback 2.5 delivery that
|
|
|
|
+ fails with
|
|
|
|
+ connection or
|
|
|
|
+ handshake
|
|
|
|
+ failure
|
|
|
|
+
|
|
|
|
+ Number of
|
|
|
|
+ failed
|
|
|
|
+ pseudo-
|
|
|
|
+ cohorts after
|
|
|
|
+ default_destination_concurrency_failed_cohort_limit 2.5 which a
|
|
|
|
+ transport_destination_concurrency_failed_cohort_limit 2.5 destination
|
|
|
|
+ is declared
|
|
|
|
+ "dead" and
|
|
|
|
+ delivery is
|
|
|
|
+ suspended
|
|
|
|
+
|
|
|
|
+ Enable
|
|
|
|
+ verbose
|
|
|
|
+ destination_concurrency_feedback_debug 2.5 logging of
|
|
|
|
+ concurrency
|
|
|
|
+ scheduler
|
|
|
|
+ activity
|
|
|
|
+
|
|
|
|
+ ---------------------------------------------------------------------------
|
|
|
|
+
|
|
|
|
+PPrreeeemmppttiivvee sscchheedduulliinngg
|
|
|
|
+
|
|
|
|
+The following sections describe the new queue manager and its preemptive
|
|
|
|
+scheduler algorithm. Note that the document was originally written to describe
|
|
|
|
+the changes between the new queue manager (in this text referred to as nqmgr,
|
|
|
|
+the name it was known by before it became the default queue manager) and the
|
|
|
|
+old queue manager (referred to as oqmgr). This is why it refers to oqmgr every
|
|
|
|
+so often.
|
|
|
|
+
|
|
|
|
+This document is divided into sections as follows:
|
|
|
|
+
|
|
|
|
+ * The structures used by nqmgr
|
|
|
|
+ * What happens when nqmgr picks up the message - how it is assigned to
|
|
|
|
+ transports, jobs, peers, entries
|
|
|
|
+ * How the entry selection works
|
|
|
|
+ * How the preemption works - what messages may be preempted and how and what
|
|
|
|
+ messages are chosen to preempt them
|
|
|
|
+ * How destination concurrency limits affect the scheduling algorithm
|
|
|
|
+ * Dealing with memory resource limits
|
|
|
|
+
|
|
|
|
+TThhee ssttrruuccttuurreess uusseedd bbyy nnqqmmggrr
|
|
|
|
+
|
|
|
|
+Let's start by recapitulating the structures and terms used when referring to
|
|
|
|
+queue manager and how it operates. Many of these are partially described
|
|
|
|
+elsewhere, but it is nice to have a coherent overview in one place:
|
|
|
|
+
|
|
|
|
+ * Each message structure represents one mail message which Postfix is to
|
|
|
|
+ deliver. The message recipients specify to what destinations is the message
|
|
|
|
+ to be delivered and what transports are going to be used for the delivery.
|
|
|
|
+
|
|
|
|
+ * Each recipient entry groups a batch of recipients of one message which are
|
|
|
|
+ all going to be delivered to the same destination (and over the same
|
|
|
|
+ transport).
|
|
|
|
+
|
|
|
|
+ * Each transport structure groups everything what is going to be delivered by
|
|
|
|
+ delivery agents dedicated for that transport. Each transport maintains a
|
|
|
|
+ set of queues (describing the destinations it shall talk to) and jobs
|
|
|
|
+ (referencing the messages it shall deliver).
|
|
|
|
+
|
|
|
|
+ * Each transport queue (not to be confused with the on-disk active queue or
|
|
|
|
+ incoming queue) groups everything what is going be delivered to given
|
|
|
|
+ destination (aka nexthop) by its transport. Each queue belongs to one
|
|
|
|
+ transport, so each destination may be referred to by several queues, one
|
|
|
|
+ for each transport. Each queue maintains a list of all recipient entries
|
|
|
|
+ (batches of message recipients) which shall be delivered to given
|
|
|
|
+ destination (the todo list), and a list of recipient entries already being
|
|
|
|
+ delivered by the delivery agents (the busy list).
|
|
|
|
+
|
|
|
|
+ * Each queue corresponds to multiple peer structures. Each peer structure is
|
|
|
|
+ like the queue structure, belonging to one transport and referencing one
|
|
|
|
+ destination. The difference is that it lists only the recipient entries
|
|
|
|
+ which all originate from the same message, unlike the queue structure,
|
|
|
|
+ whose entries may originate from various messages. For messages with few
|
|
|
|
+ recipients, there is usually just one recipient entry for each destination,
|
|
|
|
+ resulting in one recipient entry per peer. But for large mailing list
|
|
|
|
+ messages the recipients may need to be split to multiple recipient entries,
|
|
|
|
+ in which case the peer structure may list many entries for single
|
|
|
|
+ destination.
|
|
|
|
+
|
|
|
|
+ * Each transport job groups everything it takes to deliver one message via
|
|
|
|
+ its transport. Each job represents one message within the context of the
|
|
|
|
+ transport. The job belongs to one transport and message, so each message
|
|
|
|
+ may have multiple jobs, one for each transport. The job groups all the peer
|
|
|
|
+ structures, which describe the destinations the job's message has to be
|
|
|
|
+ delivered to.
|
|
|
|
+
|
|
|
|
+The first four structures are common to both nqmgr and oqmgr, the latter two
|
|
|
|
+were introduced by nqmgr.
|
|
|
|
+
|
|
|
|
+These terms are used extensively in the text below, feel free to look up the
|
|
|
|
+description above anytime you'll feel you have lost a sense what is what.
|
|
|
|
+
|
|
|
|
+WWhhaatt hhaappppeennss wwhheenn nnqqmmggrr ppiicckkss uupp tthhee mmeessssaaggee
|
|
|
|
+
|
|
|
|
+Whenever nqmgr moves a queue file into the active queue, the following happens:
|
|
|
|
+It reads all necessary information from the queue file as oqmgr does, and also
|
|
|
|
+reads as many recipients as possible - more on that later, for now let's just
|
|
|
|
+pretend it always reads all recipients.
|
|
|
|
+
|
|
|
|
+Then it resolves the recipients as oqmgr does, which means obtaining (address,
|
|
|
|
+nexthop, transport) triple for each recipient. For each triple, it finds the
|
|
|
|
+transport; if it does not exist yet, it instantiates it (unless it's dead).
|
|
|
|
+Within the transport, it finds the destination queue for given nexthop; if it
|
|
|
|
+does not exist yet, it instantiates it (unless it's dead). The triple is then
|
|
|
|
+bound to given destination queue. This happens in qmgr_resolve() and is
|
|
|
|
+basically the same as in oqmgr.
|
|
|
|
+
|
|
|
|
+Then for each triple which was bound to some queue (and thus transport), the
|
|
|
|
+program finds the job which represents the message within that transport's
|
|
|
|
+context; if it does not exist yet, it instantiates it. Within the job, it finds
|
|
|
|
+the peer which represents the bound destination queue within this jobs context;
|
|
|
|
+if it does not exist yet, it instantiates it. Finally, it stores the address
|
|
|
|
+from the resolved triple to the recipient entry which is appended to both the
|
|
|
|
+queue entry list and the peer entry list. The addresses for same nexthop are
|
|
|
|
+batched in the entries up to recipient_concurrency limit for that transport.
|
|
|
|
+This happens in qmgr_assign() and apart from that it operates with job and peer
|
|
|
|
+structures it is basically the same as in oqmgr.
|
|
|
|
+
|
|
|
|
+When the job is instantiated, it is enqueued on the transport's job list based
|
|
|
|
+on the time its message was picked up by nqmgr. For first batch of recipients
|
|
|
|
+this means it is appended to the end of the job list, but the ordering of the
|
|
|
|
+job list by the enqueue time is important as we will see shortly.
|
|
|
|
+
|
|
|
|
+[Now you should have pretty good idea what is the state of the nqmgr after
|
|
|
|
+couple of messages was picked up, what is the relation between all those job,
|
|
|
|
+peer, queue and entry structures.]
|
|
|
|
+
|
|
|
|
+HHooww tthhee eennttrryy sseelleeccttiioonn wwoorrkkss
|
|
|
|
+
|
|
|
|
+Having prepared all those above mentioned structures, the task of the nqmgr's
|
|
|
|
+scheduler is to choose the recipient entries one at a time and pass them to the
|
|
|
|
+delivery agent for corresponding transport. Now how does this work?
|
|
|
|
+
|
|
|
|
+The first approximation of the new scheduling algorithm is like this:
|
|
|
|
+
|
|
|
|
+ foreach transport (round-robin-by-transport)
|
|
|
|
+ do
|
|
|
|
+ if transport busy continue
|
|
|
|
+ if transport process limit reached continue
|
|
|
|
+ foreach transport's job (in the order of the transport's job list)
|
|
|
|
+ do
|
|
|
|
+ foreach job's peer (round-robin-by-destination)
|
|
|
|
+ if peer->queue->concurrency < peer->queue->window
|
|
|
|
+ return next peer entry.
|
|
|
|
+ done
|
|
|
|
+ done
|
|
|
|
+ done
|
|
|
|
+
|
|
|
|
+Now what is the "order of the transport's job list"? As we know already, the
|
|
|
|
+job list is by default kept in the order the message was picked up by the
|
|
|
|
+nqmgr. So by default we get the top-level round-robin transport, and within
|
|
|
|
+each transport we get the FIFO message delivery. The round-robin of the peers
|
|
|
|
+by the destination is perhaps of little importance in most real-life cases
|
|
|
|
+(unless the recipient_concurrency limit is reached, in one job there is only
|
|
|
|
+one peer structure for each destination), but theoretically it makes sure that
|
|
|
|
+even within single jobs, destinations are treated fairly.
|
|
|
|
+
|
|
|
|
+[By now you should have a feeling you really know how the scheduler works,
|
|
|
|
+except for the preemption, under ideal conditions - that is, no recipient
|
|
|
|
+resource limits and no destination concurrency problems.]
|
|
|
|
+
|
|
|
|
+HHooww tthhee pprreeeemmppttiioonn wwoorrkkss
|
|
|
|
+
|
|
|
|
+As you might perhaps expect by now, the transport's job list does not remain
|
|
|
|
+sorted by the job's message enqueue time all the time. The most cool thing
|
|
|
|
+about nqmgr is not the simple FIFO delivery, but that it is able to slip mail
|
|
|
|
+with little recipients past the mailing-list bulk mail. This is what the job
|
|
|
|
+preemption is about - shuffling the jobs on the transport's job list to get the
|
|
|
|
+best message delivery rates. Now how is it achieved?
|
|
|
|
+
|
|
|
|
+First I have to tell you that there are in fact two job lists in each
|
|
|
|
+transport. One is the scheduler's job list, which the scheduler is free to play
|
|
|
|
+with, while the other one keeps the jobs always listed in the order of the
|
|
|
|
+enqueue time and is used for recipient pool management we will discuss later.
|
|
|
|
+For now, we will deal with the scheduler's job list only.
|
|
|
|
+
|
|
|
|
+So, we have the job list, which is first ordered by the time the jobs' messages
|
|
|
|
+were enqueued, oldest messages first, the most recently picked one at the end.
|
|
|
|
+For now, let's assume that there are no destination concurrency problems.
|
|
|
|
+Without preemption, we pick some entry of the first (oldest) job on the queue,
|
|
|
|
+assign it to delivery agent, pick another one from the same job, assign it
|
|
|
|
+again, and so on, until all the entries are used and the job is delivered. We
|
|
|
|
+would then move onto the next job and so on and on. Now how do we manage to
|
|
|
|
+sneak in some entries from the recently added jobs when the first job on the
|
|
|
|
+job list belongs to a message going to the mailing-list and has thousands of
|
|
|
|
+recipient entries?
|
|
|
|
+
|
|
|
|
+The nqmgr's answer is that we can artificially "inflate" the delivery time of
|
|
|
|
+that first job by some constant for free - it is basically the same trick you
|
|
|
|
+might remember as "accumulation of potential" from the amortized complexity
|
|
|
|
+lessons. For example, instead of delivering the entries of the first job on the
|
|
|
|
+job list every time a delivery agent becomes available, we can do it only every
|
|
|
|
+second time. If you view the moments the delivery agent becomes available on a
|
|
|
|
+timeline as "delivery slots", then instead of using every delivery slot for the
|
|
|
|
+first job, we can use only every other slot, and still the overall delivery
|
|
|
|
+efficiency of the first job remains the same. So the delivery 11112222 becomes
|
|
|
|
+1.1.1.1.2.2.2.2 (1 and 2 are the imaginary job numbers, . denotes the free
|
|
|
|
+slot). Now what do we do with free slots?
|
|
|
|
+
|
|
|
|
+As you might have guessed, we will use them for sneaking the mail with little
|
|
|
|
+recipients in. For example, if we have one four-recipient mail followed by four
|
|
|
|
+one recipients mail, the delivery sequence (that is, the sequence in which the
|
|
|
|
+jobs are assigned to the delivery slots) might look like this: 12131415. Hmm,
|
|
|
|
+fine for sneaking in the single recipient mail, but how do we sneak in the mail
|
|
|
|
+with more than one recipient? Say if we have one four-recipient mail followed
|
|
|
|
+by two two-recipient mails?
|
|
|
|
+
|
|
|
|
+The simple answer would be to use delivery sequence 12121313. But the problem
|
|
|
|
+is that this does not scale well. Imagine you have mail with thousand
|
|
|
|
+recipients followed by mail with hundred recipients. It is tempting to suggest
|
|
|
|
+the delivery sequence like 121212...., but alas! Imagine there arrives another
|
|
|
|
+mail with say ten recipients. But there are no free slots anymore, so it can't
|
|
|
|
+slip by, not even if it had just only one recipients. It will be stuck until
|
|
|
|
+the hundred-recipient mail is delivered, which really sucks.
|
|
|
|
+
|
|
|
|
+So, it becomes obvious that while inflating the message to get free slots is
|
|
|
|
+great idea, one has to be really careful of how the free slots are assigned,
|
|
|
|
+otherwise one might corner himself. So, how does nqmgr really use the free
|
|
|
|
+slots?
|
|
|
|
+
|
|
|
|
+The key idea is that one does not have to generate the free slots in a uniform
|
|
|
|
+way. The delivery sequence 111...1 is no worse than 1.1.1.1, in fact, it is
|
|
|
|
+even better as some entries are in the first case selected earlier than in the
|
|
|
|
+second case, and none is selected later! So it is possible to first
|
|
|
|
+"accumulate" the free delivery slots and then use them all at once. It is even
|
|
|
|
+possible to accumulate some, then use them, then accumulate some more and use
|
|
|
|
+them again, as in 11..1.1 .
|
|
|
|
+
|
|
|
|
+Let's get back to the one hundred recipient example. We now know that we could
|
|
|
|
+first accumulate one hundred free slots, and only after then to preempt the
|
|
|
|
+first job and sneak the one hundred recipient mail in. Applying the algorithm
|
|
|
|
+recursively, we see the hundred recipient job can accumulate ten free delivery
|
|
|
|
+slots, and then we could preempt it and sneak in the ten-recipient mail... Wait
|
|
|
|
+wait wait! Could we? Aren't we overinflating the original one thousand
|
|
|
|
+recipient mail?
|
|
|
|
+
|
|
|
|
+Well, despite it looks so at the first glance, another trick will allow us to
|
|
|
|
+answer "no, we are not!". If we had said that we will inflate the delivery time
|
|
|
|
+twice at maximum, and then we consider every other slot as a free slot, then we
|
|
|
|
+would overinflate in case of the recursive preemption. BUT! The trick is that
|
|
|
|
+if we use only every n-th slot as a free slot for n>2, there is always some
|
|
|
|
+worst inflation factor which we can guarantee not to be breached, even if we
|
|
|
|
+apply the algorithm recursively. To be precise, if for every k>1 normally used
|
|
|
|
+slots we accumulate one free delivery slot, than the inflation factor is not
|
|
|
|
+worse than k/(k-1) no matter how many recursive preemptions happen. And it's
|
|
|
|
+not worse than (k+1)/k if only non-recursive preemption happens. Now, having
|
|
|
|
+got through the theory and the related math, let's see how nqmgr implements
|
|
|
|
+this.
|
|
|
|
+
|
|
|
|
+Each job has so called "available delivery slot" counter. Each transport has a
|
|
|
|
+transport_delivery_slot_cost parameter, which defaults to
|
|
|
|
+default_delivery_slot_cost parameter which is set to 5 by default. This is the
|
|
|
|
+k from the paragraph above. Each time k entries of the job are selected for
|
|
|
|
+delivery, this counter is incremented by one. Once there are some slots
|
|
|
|
+accumulated, job which requires no more than that number of slots to be fully
|
|
|
|
+delivered can preempt this job.
|
|
|
|
+
|
|
|
|
+[Well, the truth is, the counter is incremented every time an entry is selected
|
|
|
|
+and it is divided by k when it is used. But for the understanding it's good
|
|
|
|
+enough to use the above approximation of the truth.]
|
|
|
|
+
|
|
|
|
+OK, so now we know the conditions which must be satisfied so one job can
|
|
|
|
+preempt another one. But what job gets preempted, how do we choose what job
|
|
|
|
+preempts it if there are several valid candidates, and when does all this
|
|
|
|
+exactly happen?
|
|
|
|
+
|
|
|
|
+The answer for the first part is simple. The job whose entry was selected the
|
|
|
|
+last time is so called current job. Normally, it is the first job on the
|
|
|
|
+scheduler's job list, but destination concurrency limits may change this as we
|
|
|
|
+will see later. It is always only the current job which may get preempted.
|
|
|
|
+
|
|
|
|
+Now for the second part. The current job has certain amount of recipient
|
|
|
|
+entries, and as such may accumulate at maximum some amount of available
|
|
|
|
+delivery slots. It might have already accumulated some, and perhaps even
|
|
|
|
+already used some when it was preempted before (remember a job can be preempted
|
|
|
|
+several times). In either case, we know how many are accumulated and how many
|
|
|
|
+are left to deliver, so we know how many it may yet accumulate at maximum.
|
|
|
|
+Every other job which may be delivered by less than that number of slots is a
|
|
|
|
+valid candidate for preemption. How do we choose among them?
|
|
|
|
+
|
|
|
|
+The answer is - the one with maximum enqueue_time/recipient_entry_count. That
|
|
|
|
+is, the older the job is, the more we should try to deliver it in order to get
|
|
|
|
+best message delivery rates. These rates are of course subject to how many
|
|
|
|
+recipients the message has, therefore the division by the recipient (entry)
|
|
|
|
+count. No one shall be surprised that message with n recipients takes n times
|
|
|
|
+longer to deliver than message with one recipient.
|
|
|
|
+
|
|
|
|
+Now let's recap the previous two paragraphs. Isn't it too complicated? Why
|
|
|
|
+don't the candidates come only among the jobs which can be delivered within the
|
|
|
|
+number of slots the current job already accumulated? Why do we need to estimate
|
|
|
|
+how much it has yet to accumulate? If you found out the answer, congratulate
|
|
|
|
+yourself. If we did it this simple way, we would always choose the candidate
|
|
|
|
+with least recipient entries. If there were enough single recipient mails
|
|
|
|
+coming in, they would always slip by the bulk mail as soon as possible, and the
|
|
|
|
+two and more recipients mail would never get a chance, no matter how long they
|
|
|
|
+have been sitting around in the job list.
|
|
|
|
+
|
|
|
|
+This candidate selection has interesting implication - that when we choose the
|
|
|
|
+best candidate for preemption (this is done in qmgr_choose_candidate()), it may
|
|
|
|
+happen that we may not use it for preemption immediately. This leads to an
|
|
|
|
+answer to the last part of the original question - when does the preemption
|
|
|
|
+happen?
|
|
|
|
+
|
|
|
|
+The preemption attempt happens every time next transport's recipient entry is
|
|
|
|
+to be chosen for delivery. To avoid needless overhead, the preemption is not
|
|
|
|
+attempted if the current job could never accumulate more than
|
|
|
|
+transport_minimum_delivery_slots (defaults to default_minimum_delivery_slots
|
|
|
|
+which defaults to 3). If there is already enough accumulated slots to preempt
|
|
|
|
+the current job by the chosen best candidate, it is done immediately. This
|
|
|
|
+basically means that the candidate is moved in front of the current job on the
|
|
|
|
+scheduler's job list and decreasing the accumulated slot counter by the amount
|
|
|
|
+used by the candidate. If there is not enough slots... well, I could say that
|
|
|
|
+nothing happens and the another preemption is attempted the next time. But
|
|
|
|
+that's not the complete truth.
|
|
|
|
+
|
|
|
|
+The truth is that it turns out that it is not really necessary to wait until
|
|
|
|
+the jobs counter accumulates all the delivery slots in advance. Say we have
|
|
|
|
+ten-recipient mail followed by two two-recipient mails. If the preemption
|
|
|
|
+happened when enough delivery slot accumulate (assuming slot cost 2), the
|
|
|
|
+delivery sequence becomes 11112211113311. Now what would we get if we would
|
|
|
|
+wait only for 50% of the necessary slots to accumulate and we promise we would
|
|
|
|
+wait for the remaining 50% later, after we get back to the preempted job? If we
|
|
|
|
+use such slot loan, the delivery sequence becomes 11221111331111. As we can
|
|
|
|
+see, it makes it no considerably worse for the delivery of the ten-recipient
|
|
|
|
+mail, but it allows the small messages to be delivered sooner.
|
|
|
|
+
|
|
|
|
+The concept of these slot loans is where the transport_delivery_slot_discount
|
|
|
|
+and transport_delivery_slot_loan come from (they default to
|
|
|
|
+default_delivery_slot_discount and default_delivery_slot_loan, whose values are
|
|
|
|
+by default 50 and 3, respectively). The discount (resp. loan) specifies how
|
|
|
|
+many percent (resp. how many slots) one "gets in advance", when the number of
|
|
|
|
+slots required to deliver the best candidate is compared with the number of
|
|
|
|
+slots the current slot had accumulated so far.
|
|
|
|
+
|
|
|
|
+And it pretty much concludes this chapter.
|
|
|
|
+
|
|
|
|
+[Now you should have a feeling that you pretty much understand the scheduler
|
|
|
|
+and the preemption, or at least that you will have it after you read the last
|
|
|
|
+chapter couple more times. You shall clearly see the job list and the
|
|
|
|
+preemption happening at its head, in ideal delivery conditions. The feeling of
|
|
|
|
+understanding shall last until you start wondering what happens if some of the
|
|
|
|
+jobs are blocked, which you might eventually figure out correctly from what had
|
|
|
|
+been said already. But I would be surprised if your mental image of the
|
|
|
|
+scheduler's functionality is not completely shattered once you start wondering
|
|
|
|
+how it works when not all recipients may be read in-core. More on that later.]
|
|
|
|
+
|
|
|
|
+HHooww ddeessttiinnaattiioonn ccoonnccuurrrreennccyy lliimmiittss aaffffeecctt tthhee sscchheedduulliinngg aallggoorriitthhmm
|
|
|
|
+
|
|
|
|
+The nqmgr uses the same algorithm for destination concurrency control as oqmgr.
|
|
|
|
+Now what happens when the destination limits are reached and no more entries
|
|
|
|
+for that destination may be selected by the scheduler?
|
|
|
|
+
|
|
|
|
+From user's point of view it is all simple. If some of the peers of a job can't
|
|
|
|
+be selected, those peers are simply skipped by the entry selection algorithm
|
|
|
|
+(the pseudo-code described before) and only the selectable ones are used. If
|
|
|
|
+none of the peers may be selected, the job is declared a "blocker job". Blocker
|
|
|
|
+jobs are skipped by the entry selection algorithm and they are also excluded
|
|
|
|
+from the candidates for preemption of current job. Thus the scheduler
|
|
|
|
+effectively behaves as if the blocker jobs didn't exist on the job list at all.
|
|
|
|
+As soon as at least one of the peers of a blocker job becomes unblocked (that
|
|
|
|
+is, the delivery agent handling the delivery of the recipient entry for given
|
|
|
|
+destination successfully finishes), the job's blocker status is removed and the
|
|
|
|
+job again participates in all further scheduler actions normally.
|
|
|
|
+
|
|
|
|
+So the summary is that the users don't really have to be concerned about the
|
|
|
|
+interaction of the destination limits and scheduling algorithm. It works well
|
|
|
|
+on its own and there are no knobs they would need to control it.
|
|
|
|
+
|
|
|
|
+From a programmer's point of view, the blocker jobs complicate the scheduler
|
|
|
|
+quite a lot. Without them, the jobs on the job list would be normally delivered
|
|
|
|
+in strict FIFO order. If the current job is preempted, the job preempting it is
|
|
|
|
+completely delivered unless it is preempted itself. Without blockers, the
|
|
|
|
+current job is thus always either the first job on the job list, or the top of
|
|
|
|
+the stack of jobs preempting the first job on the job list.
|
|
|
|
+
|
|
|
|
+The visualization of the job list and the preemption stack without blockers
|
|
|
|
+would be like this:
|
|
|
|
+
|
|
|
|
+ first job-> 1--2--3--5--6--8--... <- job list
|
|
|
|
+ on job list |
|
|
|
|
+ 4 <- preemption stack
|
|
|
|
+ |
|
|
|
|
+ current job-> 7
|
|
|
|
+
|
|
|
|
+In the example above we see that job 1 was preempted by job 4 and then job 4
|
|
|
|
+was preempted by job 7. After job 7 is completed, remaining entries of job 4
|
|
|
|
+are selected, and once they are all selected, job 1 continues.
|
|
|
|
+
|
|
|
|
+As we see, it's all very clean and straightforward. Now how does this change
|
|
|
|
+because of blockers?
|
|
|
|
+
|
|
|
|
+The answer is: a lot. Any job may become blocker job at any time, and also
|
|
|
|
+become normal job again at any time. This has several important implications:
|
|
|
|
+
|
|
|
|
+ 1. The jobs may be completed in arbitrary order. For example, in the example
|
|
|
|
+ above, if the current job 7 becomes blocked, the next job 4 may complete
|
|
|
|
+ before the job 7 becomes unblocked again. Or if both 7 and 4 are blocked,
|
|
|
|
+ then 1 is completed, then 7 becomes unblocked and is completed, then 2 is
|
|
|
|
+ completed and only after that 4 becomes unblocked and is completed... You
|
|
|
|
+ get the idea.
|
|
|
|
+
|
|
|
|
+ [Interesting side note: even when jobs are delivered out of order, from
|
|
|
|
+ single destination's point of view the jobs are still delivered in the
|
|
|
|
+ expected order (that is, FIFO unless there was some preemption involved).
|
|
|
|
+ This is because whenever a destination queue becomes unblocked (the
|
|
|
|
+ destination limit allows selection of more recipient entries for that
|
|
|
|
+ destination), all jobs which have peers for that destination are unblocked
|
|
|
|
+ at once.]
|
|
|
|
+
|
|
|
|
+ 2. The idea of the preemption stack at the head of the job list is gone. That
|
|
|
|
+ is, it must be possible to preempt any job on the job list. For example, if
|
|
|
|
+ the jobs 7, 4, 1 and 2 in the example above become all blocked, job 3
|
|
|
|
+ becomes the current job. And of course we do not want the preemption to be
|
|
|
|
+ affected by the fact that there are some blocked jobs or not. Therefore, if
|
|
|
|
+ it turns out that job 3 might be preempted by job 6, the implementation
|
|
|
|
+ shall make it possible.
|
|
|
|
+
|
|
|
|
+ 3. The idea of the linear preemption stack itself is gone. It's no longer true
|
|
|
|
+ that one job is always preempted by only one job at one time (that is
|
|
|
|
+ directly preempted, not counting the recursively nested jobs). For example,
|
|
|
|
+ in the example above, job 1 is directly preempted by only job 4, and job 4
|
|
|
|
+ by job 7. Now assume job 7 becomes blocked, and job 4 is being delivered.
|
|
|
|
+ If it accumulates enough delivery slots, it is natural that it might be
|
|
|
|
+ preempted for example by job 8. Now job 4 is preempted by both job 7 AND
|
|
|
|
+ job 8 at the same time.
|
|
|
|
+
|
|
|
|
+Now combine the points 2) and 3) with point 1) again and you realize that the
|
|
|
|
+relations on the once linear job list became pretty complicated. If we extend
|
|
|
|
+the point 3) example: jobs 7 and 8 preempt job 4, now job 8 becomes blocked
|
|
|
|
+too, then job 4 completes. Tricky, huh?
|
|
|
|
+
|
|
|
|
+If I illustrate the relations after the above mentioned examples (but those in
|
|
|
|
+point 1)), the situation would look like this:
|
|
|
|
+
|
|
|
|
+ v- parent
|
|
|
|
+
|
|
|
|
+ adoptive parent -> 1--2--3--5--... <- "stack" level 0
|
|
|
|
+ | |
|
|
|
|
+ parent gone -> ? 6 <- "stack" level 1
|
|
|
|
+ / \
|
|
|
|
+ children -> 7 8 ^- child <- "stack" level 2
|
|
|
|
+
|
|
|
|
+ ^- siblings
|
|
|
|
+
|
|
|
|
+Now how does nqmgr deal with all these complicated relations?
|
|
|
|
+
|
|
|
|
+Well, it maintains them all as described, but fortunately, all these relations
|
|
|
|
+are necessary only for purposes of proper counting of available delivery slots.
|
|
|
|
+For purposes of ordering the jobs for entry selection, the original rule still
|
|
|
|
+applies: "the job preempting the current job is moved in front of the current
|
|
|
|
+job on the job list". So for entry selection purposes, the job relations remain
|
|
|
|
+as simple as this:
|
|
|
|
+
|
|
|
|
+ 7--8--1--2--6--3--5--.. <- scheduler's job list order
|
|
|
|
+
|
|
|
|
+The job list order and the preemption parent/child/siblings relations are
|
|
|
|
+maintained separately. And because the selection works only with the job list,
|
|
|
|
+you can happily forget about those complicated relations unless you want to
|
|
|
|
+study the nqmgr sources. In that case the text above might provide some helpful
|
|
|
|
+introduction to the problem domain. Otherwise I suggest you just forget about
|
|
|
|
+all this and stick with the user's point of view: the blocker jobs are simply
|
|
|
|
+ignored.
|
|
|
|
+
|
|
|
|
+[By now, you should have a feeling that there is more things going under the
|
|
|
|
+hood than you ever wanted to know. You decide that forgetting about this
|
|
|
|
+chapter is the best you can do for the sake of your mind's health and you
|
|
|
|
+basically stick with the idea how the scheduler works in ideal conditions, when
|
|
|
|
+there are no blockers, which is good enough.]
|
|
|
|
+
|
|
|
|
+DDeeaalliinngg wwiitthh mmeemmoorryy rreessoouurrccee lliimmiittss
|
|
|
|
+
|
|
|
|
+When discussing the nqmgr scheduler, we have so far assumed that all recipients
|
|
|
|
+of all messages in the active queue are completely read into the memory. This
|
|
|
|
+is simply not true. There is an upper bound on the amount of memory the nqmgr
|
|
|
|
+may use, and therefore it must impose some limits on the information it may
|
|
|
|
+store in the memory at any given time.
|
|
|
|
+
|
|
|
|
+First of all, not all messages may be read in-core at once. At any time, only
|
|
|
|
+qmgr_message_active_limit messages may be read in-core at maximum. When read
|
|
|
|
+into memory, the messages are picked from the incoming and deferred message
|
|
|
|
+queues and moved to the active queue (incoming having priority), so if there is
|
|
|
|
+more than qmgr_message_active_limit messages queued in the active queue, the
|
|
|
|
+rest will have to wait until (some of) the messages in the active queue are
|
|
|
|
+completely delivered (or deferred).
|
|
|
|
+
|
|
|
|
+Even with the limited amount of in-core messages, there is another limit which
|
|
|
|
+must be imposed in order to avoid memory exhaustion. Each message may contain
|
|
|
|
+huge amount of recipients (tens or hundreds of thousands are not uncommon), so
|
|
|
|
+if nqmgr read all recipients of all messages in the active queue, it may easily
|
|
|
|
+run out of memory. Therefore there must be some upper bound on the amount of
|
|
|
|
+message recipients which are read into the memory at the same time.
|
|
|
|
+
|
|
|
|
+Before discussing how exactly nqmgr implements the recipient limits, let's see
|
|
|
|
+how the sole existence of the limits themselves affects the nqmgr and its
|
|
|
|
+scheduler.
|
|
|
|
+
|
|
|
|
+The message limit is straightforward - it just limits the size of the lookahead
|
|
|
|
+the nqmgr's scheduler has when choosing which message can preempt the current
|
|
|
|
+one. Messages not in the active queue simply are not considered at all.
|
|
|
|
+
|
|
|
|
+The recipient limit complicates more things. First of all, the message reading
|
|
|
|
+code must support reading the recipients in batches, which among other things
|
|
|
|
+means accessing the queue file several times and continuing where the last
|
|
|
|
+recipient batch ended. This is invoked by the scheduler whenever the current
|
|
|
|
+job has space for more recipients, subject to transport's refill_limit and
|
|
|
|
+refill_delay parameters. It is also done any time when all in-core recipients
|
|
|
|
+of the message are dealt with (which may also mean they were deferred) but
|
|
|
|
+there are still more in the queue file.
|
|
|
|
+
|
|
|
|
+The second complication is that with some recipients left unread in the queue
|
|
|
|
+file, the scheduler can't operate with exact counts of recipient entries. With
|
|
|
|
+unread recipients, it is not clear how many recipient entries there will be, as
|
|
|
|
+they are subject to per-destination grouping. It is not even clear to what
|
|
|
|
+transports (and thus jobs) the recipients will be assigned. And with messages
|
|
|
|
+coming from the deferred queue, it is not even clear how many unread recipients
|
|
|
|
+are still to be delivered. This all means that the scheduler must use only
|
|
|
|
+estimates of how many recipients entries there will be. Fortunately, it is
|
|
|
|
+possible to estimate the minimum and maximum correctly, so the scheduler can
|
|
|
|
+always err on the safe side. Obviously, the better the estimates, the better
|
|
|
|
+results, so it is best when we are able to read all recipients in-core and turn
|
|
|
|
+the estimates into exact counts, or at least try to read as many as possible to
|
|
|
|
+make the estimates as accurate as possible.
|
|
|
|
+
|
|
|
|
+The third complication is that it is no longer true that the scheduler is done
|
|
|
|
+with a job once all of its in-core recipients are delivered. It is possible
|
|
|
|
+that the job will be revived later, when another batch of recipients is read in
|
|
|
|
+core. It is also possible that some jobs will be created for the first time
|
|
|
|
+long after the first batch of recipients was read in core. The nqmgr code must
|
|
|
|
+be ready to handle all such situations.
|
|
|
|
+
|
|
|
|
+And finally, the fourth complication is that the nqmgr code must somehow impose
|
|
|
|
+the recipient limit itself. Now how does it achieve it?
|
|
|
|
+
|
|
|
|
+Perhaps the easiest solution would be to say that each message may have at
|
|
|
|
+maximum X recipients stored in-core, but such solution would be poor for
|
|
|
|
+several reasons. With reasonable qmgr_message_active_limit values, the X would
|
|
|
|
+have to be quite low to maintain reasonable memory footprint. And with low X
|
|
|
|
+lots of things would not work well. The nqmgr would have problems to use the
|
|
|
|
+transport_destination_recipient_limit efficiently. The scheduler's preemption
|
|
|
|
+would be suboptimal as the recipient count estimates would be inaccurate. The
|
|
|
|
+message queue file would have to be accessed many times to read in more
|
|
|
|
+recipients again and again.
|
|
|
|
+
|
|
|
|
+Therefore it seems reasonable to have a solution which does not use a limit
|
|
|
|
+imposed on per-message basis, but which maintains a pool of available recipient
|
|
|
|
+slots, which can be shared among all messages in the most efficient manner. And
|
|
|
|
+as we do not want separate transports to compete for resources whenever
|
|
|
|
+possible, it seems appropriate to maintain such recipient pool for each
|
|
|
|
+transport separately. This is the general idea, now how does it work in
|
|
|
|
+practice?
|
|
|
|
+
|
|
|
|
+First we have to solve little chicken-and-egg problem. If we want to use the
|
|
|
|
+per-transport recipient pools, we first need to know to what transport(s) is
|
|
|
|
+the message assigned. But we will find that out only after we read in the
|
|
|
|
+recipients first. So it is obvious that we first have to read in some
|
|
|
|
+recipients, use them to find out to what transports is the message to be
|
|
|
|
+assigned, and only after that we can use the per-transport recipient pools.
|
|
|
|
+
|
|
|
|
+Now how many recipients shall we read for the first time? This is what
|
|
|
|
+qmgr_message_recipient_minimum and qmgr_message_recipient_limit values control.
|
|
|
|
+The qmgr_message_recipient_minimum value specifies how many recipients of each
|
|
|
|
+message we will read for the first time, no matter what. It is necessary to
|
|
|
|
+read at least one recipient before we can assign the message to a transport and
|
|
|
|
+create the first job. However, reading only qmgr_message_recipient_minimum
|
|
|
|
+recipients even if there are only few messages with few recipients in-core
|
|
|
|
+would be wasteful. Therefore if there is less than qmgr_message_recipient_limit
|
|
|
|
+recipients in-core so far, the first batch of recipients may be larger than
|
|
|
|
+qmgr_message_recipient_minimum - as large as is required to reach the
|
|
|
|
+qmgr_message_recipient_limit limit.
|
|
|
|
+
|
|
|
|
+Once the first batch of recipients was read in core and the message jobs were
|
|
|
|
+created, the size of the subsequent recipient batches (if any - of course it's
|
|
|
|
+best when all recipients are read in one batch) is based solely on the position
|
|
|
|
+of the message jobs on their corresponding transports' job lists. Each
|
|
|
|
+transport has a pool of transport_recipient_limit recipient slots which it can
|
|
|
|
+distribute among its jobs (how this is done is described later). The subsequent
|
|
|
|
+recipient batch may be as large as the sum of all recipient slots of all jobs
|
|
|
|
+of the message permits (plus the qmgr_message_recipient_minimum amount which
|
|
|
|
+always applies).
|
|
|
|
+
|
|
|
|
+For example, if a message has three jobs, first with 1 recipient still in-core
|
|
|
|
+and 4 recipient slots, second with 5 recipient in-core and 5 recipient slots,
|
|
|
|
+and third with 2 recipients in-core and 0 recipient slots, it has 1+5+2=7
|
|
|
|
+recipients in-core and 4+5+0=9 jobs' recipients slots in total. This means that
|
|
|
|
+we could immediately read 2+qmgr_message_recipient_minimum more recipients of
|
|
|
|
+that message in core.
|
|
|
|
+
|
|
|
|
+The above example illustrates several things which might be worth mentioning
|
|
|
|
+explicitly: first, note that although the per-transport slots are assigned to
|
|
|
|
+particular jobs, we can't guarantee that once the next batch of recipients is
|
|
|
|
+read in core, that the corresponding amounts of recipients will be assigned to
|
|
|
|
+those jobs. The jobs lend its slots to the message as a whole, so it is
|
|
|
|
+possible that some jobs end up sponsoring other jobs of their message. For
|
|
|
|
+example, if in the example above the 2 newly read recipients were assigned to
|
|
|
|
+the second job, the first job sponsored the second job with 2 slots. The second
|
|
|
|
+notable thing is the third job, which has more recipients in-core than it has
|
|
|
|
+slots. Apart from sponsoring by other job we just saw it can be result of the
|
|
|
|
+first recipient batch, which is sponsored from global recipient pool of
|
|
|
|
+qmgr_message_recipient_limit recipients. It can be also sponsored from the
|
|
|
|
+message recipient pool of qmgr_message_recipient_minimum recipients.
|
|
|
|
+
|
|
|
|
+Now how does each transport distribute the recipient slots among its jobs? The
|
|
|
|
+strategy is quite simple. As most scheduler activity happens on the head of the
|
|
|
|
+job list, it is our intention to make sure that the scheduler has the best
|
|
|
|
+estimates of the recipient counts for those jobs. As we mentioned above, this
|
|
|
|
+means that we want to try to make sure that the messages of those jobs have all
|
|
|
|
+recipients read in-core. Therefore the transport distributes the slots "along"
|
|
|
|
+the job list from start to end. In this case the job list sorted by message
|
|
|
|
+enqueue time is used, because it doesn't change over time as the scheduler's
|
|
|
|
+job list does.
|
|
|
|
+
|
|
|
|
+More specifically, each time a job is created and appended to the job list, it
|
|
|
|
+gets all unused recipient slots from its transport's pool. It keeps them until
|
|
|
|
+all recipients of its message are read. When this happens, all unused recipient
|
|
|
|
+slots are transferred to the next job (which is now in fact now first such job)
|
|
|
|
+on the job list which still has some recipients unread, or eventually back to
|
|
|
|
+the transport pool if there is no such job. Such transfer then also happens
|
|
|
|
+whenever a recipient entry of that job is delivered.
|
|
|
|
+
|
|
|
|
+There is also a scenario when a job is not appended to the end of the job list
|
|
|
|
+(for example it was created as a result of second or later recipient batch).
|
|
|
|
+Then it works exactly as above, except that if it was put in front of the first
|
|
|
|
+unread job (that is, the job of a message which still has some unread
|
|
|
|
+recipients in queue file), that job is first forced to return all of its unused
|
|
|
|
+recipient slots to the transport pool.
|
|
|
|
+
|
|
|
|
+The algorithm just described leads to the following state: The first unread job
|
|
|
|
+on the job list always gets all the remaining recipient slots of that transport
|
|
|
|
+(if there are any). The jobs queued before this job are completely read (that
|
|
|
|
+is, all recipients of their message were already read in core) and have at
|
|
|
|
+maximum as many slots as they still have recipients in-core (the maximum is
|
|
|
|
+there because of the sponsoring mentioned before) and the jobs after this job
|
|
|
|
+get nothing from the transport recipient pool (unless they got something before
|
|
|
|
+and then the first unread job was created and enqueued in front of them later -
|
|
|
|
+in such case the also get at maximum as many slots as they have recipients in-
|
|
|
|
+core).
|
|
|
|
+
|
|
|
|
+Things work fine in such state for most of the time, because the current job is
|
|
|
|
+either completely read in-core or has as much recipient slots as there are, but
|
|
|
|
+there is one situation which we still have to take care of specially. Imagine
|
|
|
|
+if the current job is preempted by some unread job from the job list and there
|
|
|
|
+are no more recipient slots available, so this new current job could read only
|
|
|
|
+batches of qmgr_message_recipient_minimum recipients at a time. This would
|
|
|
|
+really degrade performance. For this reason, each transport has extra pool of
|
|
|
|
+transport_extra_recipient_limit recipient slots, dedicated exactly for this
|
|
|
|
+situation. Each time an unread job preempts the current job, it gets half of
|
|
|
|
+the remaining recipient slots from the normal pool and this extra pool.
|
|
|
|
+
|
|
|
|
+And that's it. It sure does sound pretty complicated, but fortunately most
|
|
|
|
+people don't really have to care how exactly it works as long as it works.
|
|
|
|
+Perhaps the only important things to know for most people are the following
|
|
|
|
+upper bound formulas:
|
|
|
|
+
|
|
|
|
+Each transport has at maximum
|
|
|
|
+
|
|
|
|
+ max(
|
|
|
|
+ qmgr_message_recipient_minimum * qmgr_message_active_limit
|
|
|
|
+ + *_recipient_limit + *_extra_recipient_limit,
|
|
|
|
+ qmgr_message_recipient_limit
|
|
|
|
+ )
|
|
|
|
+
|
|
|
|
+recipients in core.
|
|
|
|
+
|
|
|
|
+The total amount of recipients in core is
|
|
|
|
+
|
|
|
|
+ max(
|
|
|
|
+ qmgr_message_recipient_minimum * qmgr_message_active_limit
|
|
|
|
+ + sum( *_recipient_limit + *_extra_recipient_limit ),
|
|
|
|
+ qmgr_message_recipient_limit
|
|
|
|
+ )
|
|
|
|
+
|
|
|
|
+where the sum is over all used transports.
|
|
|
|
+
|
|
|
|
+And this terribly complicated chapter concludes the documentation of nqmgr
|
|
|
|
+scheduler.
|
|
|
|
+
|
|
|
|
+[By now you should theoretically know the nqmgr scheduler inside out. In
|
|
|
|
+practice, you still hope that you will never have to really understand the last
|
|
|
|
+or last two chapters completely, and fortunately most people really won't.
|
|
|
|
+Understanding how the scheduler works in ideal conditions is more than good
|
|
|
|
+enough for vast majority of users.]
|
|
|
|
+
|
|
|
|
+CCrreeddiittss
|
|
|
|
+
|
|
|
|
+ * Wietse Venema designed and implemented the initial queue manager with per-
|
|
|
|
+ domain FIFO scheduling, and per-delivery +/-1 concurrency feedback.
|
|
|
|
+ * Patrik Rak designed and implemented preemption where mail with fewer
|
|
|
|
+ recipients can slip past mail with more recipients in a controlled manner,
|
|
|
|
+ and wrote up its documentation.
|
|
|
|
+ * Wietse Venema initiated a discussion with Patrik Rak and Victor Duchovni on
|
|
|
|
+ alternatives for the +/-1 feedback scheduler's aggressive behavior. This is
|
|
|
|
+ when K/N feedback was reviewed (N = concurrency). The discussion ended
|
|
|
|
+ without a good solution for both negative feedback and dead site detection.
|
|
|
|
+ * Victor Duchovni resumed work on concurrency feedback in the context of
|
|
|
|
+ concurrency-limited servers.
|
|
|
|
+ * Wietse Venema then re-designed the concurrency scheduler in terms of the
|
|
|
|
+ simplest possible concepts: less-than-1 concurrency feedback per delivery,
|
|
|
|
+ forward and reverse concurrency feedback hysteresis, and pseudo-cohort
|
|
|
|
+ failure. At this same time, concurrency feedback was separated from dead
|
|
|
|
+ site detection.
|
|
|
|
+ * These simplifications, and their modular implementation, helped to develop
|
|
|
|
+ further insights into the different roles that positive and negative
|
|
|
|
+ concurrency feedback play, and helped to identify some worst-case
|
|
|
|
+ scenarios.
|
|
|
|
+
|