绿豆薏仁汤对皮肤好吗:Dynamic writeback throttling [LWN.net]

来源:百度文库 编辑:偶看新闻 时间:2024/04/28 11:21:14

Dynamic writeback throttling

By Jonathan Corbet
September 15, 2010 Writeback is the process of writing dirty memory pages (i.e. those which have been modified by applications) back to persistent storage, saving the data and potentially freeing the pages for other use. System performance is heavily dependent on getting writeback right; poorly-done writeback can lead to poor I/O rates and extreme memory pressure. Over the last year, it has become increasingly clear that the Linux kernel is not doing writeback as well as it should; several developers have been putting time into improving the situation. The dynamic dirty throttling limits patch from Wu Fengguang demonstrates a new, relatively complex approach to making writeback better.

One of the key concepts behind writeback handling is that processes which are contributing the most to the problem should be the ones to suffer the most for it. In the kernel, this suffering is managed through a call to balance_dirty_pages(), which is meant to throttle a process's memory-dirtying behavior until the situation improves. That throttling is done in a straightforward way: the process is given a shovel and told to start digging. In other words, a process which has been tossed into balance_dirty_pages() is put to work finding dirty pages and arranging to have them written to disk. Once a certain number of pages have been cleaned, the process is allowed to get back to the vital task of creating more dirty pages.

[PULL QUOTE: So, when the system is under memory pressure and very much needs optimal performance from its block devices, it goes into a mode which makes that performance worse. END QUOTE] There are some problems with cleaning pages in this way, many of which have been covered elsewhere. But one of the key ones is that it tends to produce seeky I/O traffic. When writeback is handled normally in the background, the kernel does its best to clean substantial numbers of pages of the same file at the same time. Since filesystems work hard to lay out file blocks contiguously whenever possible, writing all of a file's pages together should cause a relatively small number of head seeks, improving I/O bandwidth. As soon as balance_dirty_pages() gets into the act, though, the block layer is suddenly confronted with writeback from multiple sources; that can only lead to a seekier I/O pattern and reduced bandwidth. So, when the system is under memory pressure and very much needs optimal performance from its block devices, it goes into a mode which makes that performance worse.

Fengguang's 17-part patch makes a number of changes, starting with removing any direct writeback work from balance_dirty_pages(). Instead, the offending process simply goes to sleep for a while, secure in the knowledge that writeback is being handled by other parts of the system. That should lead to better I/O performance, but also to more predictable and controllable pauses for memory-intensive applications.

Much of the rest of the patch series is aimed at improving that pause calculation. It adds a new mechanism for estimating the actual bandwidth of each backing device - something the kernel does not have a good handle on, currently. Using that information, combined with the number of pages that the kernel would like to see written out before allowing a dirtying process to continue, a reasonable pause duration can be calculated. That pause is not allowed to exceed 200ms.

The patch set tries to be smarter than that, though. 200ms is a long time to pause a process which is trying to get some work done. On the other hand, without a bit of care, it is also possible to pause processes for a very short period of time, which is bad for throughput. For this patch set, it was decided that optimal pauses would be between 10ms and 100ms. This range is achieved by maintaining a separate "nr_dirtied_pause" limit for every process; if the number of dirtied pages for that process is below the limit, it is not forced to pause. Any time that balance_dirty_pages() calculates a pause time of less than 10ms, the limit is raised; if the pause turns out to be over 100ms, instead, the limit is cut in half. The desired result is a pause within the selected range which tends quickly toward the 10ms end when memory pressure drops.

Another change made by this patch series is to try to come up with a global estimate of the memory pressure on the system. When normal memory scanning encounters dirty pages, the pressure estimate is increased. If, instead, the kswapd process on the most memory-stressed node in the system goes idle, then the estimate is decreased. This estimate is then used to adjust the throttling limits applied to processes; when the system is under heavy memory pressure, memory-dirtying processes will be put on hold sooner than they otherwise would be.

There is one other important change made in this patch set. Filesystem developers have been complaining for a while that the core memory management code tells them to write back too little memory at a time. On a fast device, overly small writeback requests will fail to keep the device busy, resulting in suboptimal performance. So some filesystems (xfs and ext4) actually ignore the amount of requested writeback; they will write back many more pages than they were asked to do. That can improve performance, but it is not without its problems; in particular, sending massive write operations to slow devices can stall the system for unacceptably long times.

Once this patch set is in place, there's a better way to calculate the best writeback size. The system now knows what kind of bandwidth it can expect from each device; using that information, it can size its requests to keep the device busy for one second at a time. Throttling limits are also based on this one-second number; if there are not enough dirty pages in the system for one second of I/O activity, the backing device is probably not being used to its full capacity and the number of dirty pages should be allowed to increase. In summary: the bandwidth estimation allows the kernel to scale dirty limits and I/O sizes to make the best use of all of the devices in the system, regardless of any specific device's performance characteristics.

Getting this code into the mainline could take a while, though. It is a complicated set of changes to core code which is already complex; as such, it will be hard for others to review. There have been some concerns raised about the specifics of some of the heuristics. A large amount of performance testing will also be required to get this kind of change merged. So we may have to wait for a while yet, but better writeback should be coming eventually.


(Log in to post comments)

Dynamic writeback throttling

Posted Sep 16, 2010 8:05 UTC (Thu) by zmi (subscriber, #4829) [Link]

Sounds very nice. Has anyone had an eye on virtualized servers yet? If I run a SLES11 which runs 20 Linux VMs, it would be interesting to have some kind of global memory/disk pressure mechanism. If one out of the 20 VMs is under heavy pressure, a similar "pause others" would be welcome.

Also, some kind of auto-configuration for VMs would be nice, for example use the NOOP I/O scheduler, as the guest's can't optimize anyway, that must be done by the host and the RAID controller.

It seems to me developers still only/mostly look at single server or workstation performance, with a single user, and I wonder if VM behaviour couldn't be improved a lot.

Dynamic writeback throttling

Posted Sep 23, 2010 13:52 UTC (Thu) by i3839 (guest, #31386) [Link]

This seems a lot better than what we got now, but there seems room for improvement.

Dirty throttling should be mostly independent of memory pressure. If you start throttling IO when getting under memory pressure, the damage can already be done. Throttling should always happen when the rate of dirtying is greater than the rate of write out. This automatically finds the best buffer size for that particular IO and device. The tricky part is measuring the IO speed. If you do throttling per task then you have to measure the IO speed per task too (the difference between max, min and avg IO speed is just too great).

A reason to not throttle would be to cache dirty memory in the hope that it will be rewritten/removed soon so that overall less is written. Another reason is when something wants to use that just written data immediately. And things like laptopmode might delay writes further. How much to cache does depend on memory pressure. Extra caching should be the exception, not the default algorithm.

Another concern is latency, mostly for unrelated read IOs.

For rotating disks it's most efficient to give them as much writes as possible, to fill up their write buffer and reduce the seek cost. Even then 100MB is a tad excessive though, especially in a system with many disks.

SSDs need a lot less write data to keep them saturated. Even the most crappy ones should be close to maximum throughput with a couple of MBs outstanding. More importantly, this figure is independent on the speed of the SSD, faster SSDs won't need more data. So the 1 second of work rule of thumb is a bit flawed.

Also the effective disk throughput depends on how many read IOs happen at the same time, so I think something more dynamic is needed than a handful of arbitrary thresholds.

All in all this is a big step in the right direction, so I hope it gets merged soon.