What Stone dream about: March 2016

My ramblings

No technical stuff in here. If you are here for the tech skip to introduction.

First up the KASRL blog part 2 is in progress. It might take a while because there is a bit more meat on that bone than I’d expected and I want to do it right. But for now I hope you’ll enjoy this more speculative row hammer related post.

My all time favorite paper is [1]“Factors affecting voluntary alcohol consumption in the albino rat” by Kalervo Eriksson. I havn’t read it. But everything about the title fascinates me. Why Albino rats? What factors? Why does it intrigue me? Alcoholic rats? Binge drinking or occasional relaxation? Beer or whiskey? What is a albino rat hangover like? And most of all what was the motivation of the author? I shall not know anytime soon because I enjoy pondering these questions too much.

The second on my list is one I’ve actually read. [2][“Leassons from the Bell Curve” by James Heckman. I know James Heckman’s stuff pretty well because I spend a few years studying micro-econometrics and he is an amazing econometrician and was awarded nobel prize for his work in this area. But this paper stood out to me. It’s not on economics (at least directly). It was not a particular original paper, since it took it’s base in a controversial book. The book, “The bell curve”, was about the influence and distribution of intelligence. The critic this book received (and it was a lot) was mostly politically correct dissent that did not engage with the evidence based research presented in the book. Heckman took the same data, and systematically took it apart with state-of-the-art econometric methods dismissing some conclusions, confirming others with no regard to the political correctness of his results. It is in my mind too rare that this kind of papers is published yet it is very much at the core of science to check the robustness of other people’s results, to apply new methods to old data. And that paper is the inspiration for me when I in this blog critic other people’s work.

Introduction

The subject of this blog post is a paper by [1] Aweke et al. called “ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”. (it can be found here: http://iss.oy.ne.ro/) In it they first develop a cache eviction based row hammer attack to supplement the classic clflush – much along the lines of the rowhammer.js attack. Then they developed a mitigation method for the row hammer attack and I’ll spend some time looking into their work below. They test this method against the classic row hammer attack using clflush and the cache eviction based attack. In this blog post I’ll first shortly summarize the paper, then comment on it and finally look into how the mitigation may be bypassed by next generation row hammer attacks. I shall not develop these next generation attacks in full, so it is somewhat speculative if they would actually work, but I hope to point fingers in directions where row hammer could go in response to mitigations such as the one in this paper. There are a number of reasons for this. First is that I don’t actually have access to their mitigation and thus cannot test against it. Secondly, I have to borrow a computer to flip bits as mine and my wife both do not seem to get bit flips when row hammering. Third is that I am somewhat reluctant to fully develop new attacks for which there currently is no protection. Finally and most importantly I’m a blogger and I’m doing this stuff after hours between all my other hobbies and obligations – it’s simply not within my time budget .

Despite of any critique in here I should make it very clear beforehand that I really like the paper and think it is an important contribution. If it wasn’t I wouldn’t spend time on it. The method certainly passes the litmus test of raising attacker cost and we must always remember that perfection is the enemy of good. Unfortunately this blog post turned out a bit more technical than I would’ve liked it to. To really engage with this blog post you need some knowledge of row hammer, performance counters and the cache subsystem.. Regular readers of my blog should be ok.

Ultra short summary of the Anvil paper

The papers results are:

1. Bit flips can be caused in 15ms.

2. Because of 1, The suggested mitigation of increasing the refresh rate on D-Ram from a refresh every 64ms to one every 32ms is not a sufficient protection.

3. Row hammer can be done with out clflush. [5]Seaborn & Dullien suggested limiting access to clflush to kernel mode and google removed support for the clflush instruction in the NaCL sandbox. The authors show that this is not a viable solution because one cause cache evict and thereby bypass the cache to access physical memory fast enough to cause row hammering.

4. Then they go on to develop a software mitigation: Anvil. Anvil works in 3 stages:

a. Stage 1: Using the fact that high rates of cache misses in the last level cache are rare on modern computers they use the LLC Cache miss performance counter as a first check if an attacker is row hammering. If LLC Cache misses exceeds a threshold per time unit Anvil continues to the second phase, if not it just continues monitoring. In addition to monitoring LLC Cache misses, it also monitors the amount of MEM_LOAD_UOPS_RETIRED_LLC_MISS this value gets used in the second stage.

b. Stage 2: In the second stage Anvil uses the PEBS performance events to sample loads and/or stores. These performance counters can samples load/stores above a given latency. The authors set this latency to match that of a cache miss. The advantage of using these performance counters is that they provide the load/store address. This allows the authors to identify if the same rows are being used again and again or if the access pattern is all across memory and thus benign. If it’s not benign they proceed to stage 3. To cut down on overhead of sampling stores with PEBS they only sample stores if the MEM_LOAD_UOPS_RETIRED_LLC_MISS counter from the first stage was significant smaller than the LLC misses. The thought here is if they are seeing only loads in the first stage a potential attack is not driven by stores.

c. State 3: Anvil has at this point detected a row hammer attack and will now thwart it. Because reading a row will automatically refresh it, Anvil uses the addresses found in the second stage to figure out which rows are being hammered and then issues a read to neighboring rows.

5. The analysis of Anvil finds that it’s comes at a low performance cost. That the first stage is cheap performance wise and that it’s rare that Anvil proceeds to stage 2.

The short commentary

Regular readers of this blog cannot be surprised about most of what is found in the Anvil paper.

1. The 15ms the authors take to cause a bitflip seems to be a bit high. [4] Yoongu et al. states that a refresh rate below 8.2ms is required to entirely kill row hammering, which again suggests that bit flips under “favorable” circumstances can be caused significantly faster.

2. It follows logically from 1 that lowering the refresh rate to 32ms does not completely deal with row hammer. Nishat Herath of Qualys and I pointed this out at Black hat and our slides and further comments can be found in my “Speaking at black hat” post..

3. That bit flips can be caused without CLFLush is also not a surprise. It confirms [6] Gruss, Maurice & Mangard, which went a step further and generated optimal cache eviction strategies and implemented a java based attack.

4. Anvil:

a. Stage 1: This is exactly the same approach that I took in my first two blog posts on row hammer detection and mitigation which formed the basis for Nishat and my talk at Black hat. All come to the result that there is a high correlation between this perf counter and row hammer [7] Gruss,Maurice&Wagner took it one step further and adjusted for over all memory activity and they too got pretty good results. I noticed false positives in my work with this method and this result is reproduced by the Anvil authors.

b. Stage 2: Here we have the first big new contribution of the paper and it’s a great one. It certainly opens up new ways to think about cache attack detection and mitigations. It is a natural connection between cache misses and actual row hammer because you get addresses. I shall examine this method in more details below.

c. Stage 3: The method of mitigation by reading neighboring rows was suggested by Nishat and myself at at our black hat talk. We only confirmed that reading victim rows would protect against row hammer, but we didn’t actually use this in our mitigation work and instead suggested causing a slow down when too many cache misses occurred. The reason is that we saw no good ways of actually inspecting which addresses where being used. I must honestly say that I missed the PEBS while designing this mitigation and found no good way of connecting a cache miss to an address. I shall analyze below to what extend Anvil succeeds at this. We suggested the slow down as a “soft response” in light of rare false positives in a.

5. The analysis of the performance penalty of Anvil relatively closely follows the results I had testing our mitigation system from black hat. I used an entirely different set of test applications and came to the same conclusions. Thus I can confirm that Anvil is likely to have a very low performance impact, that should be acceptable for most systems. The first stage is nearly free and benign applications that get miss detected in the first stage are rare. I should note here that the Aweke et al. use h264ref to test Anvil and find that this video encoder rarely triggers the first stage. However h264ref is not representative of video encoders for multiple reasons. I had in my testing problems with a h264 video encoder (that I’ve coauthored) generating too many cache misses a great many times. The h264ref is written to be portable and easily understood. Production video encoders are much more streamlined in their memory access and particularly the heavy use of streaming instructions makes actual memory speed the limiting factor in many operations – especially motion search. Modern video encoders also utilizes threading much more and that puts the L3 cache under much more pressure than the h264ref will do. Also to evaluate a video encoders impact on memory access it’s important to use the right input material – this is because video encoders tend to shortcut additional calculation if it finds something it estimates to be “good enough” ´. Never-the-less this is probably mostly a personal comment – the authors conclusions generally match my results.2.

Anvil and future row hammer attacks

Aweke et al. conclude: “We feel that these results show it is viable to protect current and future systems against row hammer attacks”. As always it’s hard to tell what’s in the future, but my understanding of next-generation attacks is that they adapt to mitigation methods. In this section I shall outline what I think are weaknesses in Anvil that attacks could exploit in the future. Unfortunately I end up being a lot more pessimistic. Even if the authors are right I’ll still prefer a hardware solution such as pTTR or PARA. The reason for this is that I’m a strong believer of fixing problems at their root cause. The cause of row hammer is a micro architectural problem and it should be fixed in micro architecture. Also we should consider that D-Ram is used in other devices than modern x86 computers. That said until we see such a solution, software might be the only solution and Anvil or other solutions certainly are options to fill the void.

My first source of pessimism is my work on the cache-coherency way of activating rows. I havn’t actually tested but I think it may bypass LLC cache miss counter as the entry for hammering never actually leaves the cache. See my blog post on MESI and rowhammer for more details.I Now I don’t know if that method is fast enough for row hammer but if it is it may affect all suggested software solutions thus far, because all three hinge on the LLC_MISS counter. Should this attack actually be fast enough and should it indeed avoid LLC cache misses, it is probably not a fatal problem. There are performance counters that cover this kind of stuff and it’s a rare event in benign software and thus we should be able to work around it. And they could be implemented as a parallel option in stage 1 of Anvil.

Aweke et al. suggest using a 6ms sampling for stage 1 and a 6ms sampling for stage 2. This means they spend 12ms detecting an attack before they respond. With [4] suggesting that a refresh rate of 8.2 ms is required to rule out row hammer attacks this might actually give an attacker enough time to flip a bit. Aweke et al. suggest that 110k activations is required to cause bit flips, if we multiply these numbers by the 55ns activation interval reported by [4] we end up with a lower bound on row hammering of only ~6ms. However the authors also evaluate a version of Anvil with 2ms intervals and conclude this version would work as well, despite slightly more overhead. They call it Anvil-heavy It would seem that Anvil-Heavy is the more relevant version of Anvil if the row hammer mitigation is supposed to be “perfect”. It is conceivable that the crosstalk effects behind row hammer will become more serious with time as memory gets denser and faster. How much wiggle room Anvil has to compensate beyond Anvil-Heavy is in my opinion an open question the two factors being how much over head lowering the intervals below 2 ms will cause through false prediction in stage 1 and for more sampling in stage 2 but also the shorter sampling interval in stage 2 can only be shorted so much before insufficient samples are collected to determine row locality of the hits. All this said I think the article underestimates the performance cost of a truly secure implementation, but they might still be acceptable.

It should be noted there is a good chance an attacker can gain a bit of introspection into Anvil. My guess is that if the attacker monitors latency in (say) a classic clflush row hammer attack he’ll be able to see when Anvil switches to the second stage. The argumentation is that sampling with PEBS causes latency on the instructions being sampled – typically interrupting the process would be in the order of magnitude of 4000 CLK’s including only rudimentary handling – that is much less than Anvil actually has to do. There is a reason Aweke et. al sees performance penalties in the second stage. It is important to note that with latencies in this order of magnitude anomalies can be detected by the attacker at very little cost, thus hardly disturbing the attack itself.

An example of how this could be used to bypass Anvil hinges on an implementation detail. Anvil does not sample store operations in the 2nd stage if store operations where rare in the first stage. This leaves room for an attacker to switch from a load based attack to a store based attack methodology mid attack and thus outwit stage 2. This again isn’t a fatal flaw by any means and can be worked around by simply always sampling stores – and the overhead should be acceptable given that it’s rare in real life that the 2nd stage even engages. But again the paper is probably underestimating the performance cost required for “perfect protection”. However I don’t think this example is the real issue. The issue is that an attacker can adapt to the defense.

The last issue with Anvil that I’ll cover in this blog post is that Anvil assumes that a high latency instruction has high latency because of loads and stores it does. While this holds true for traditional attacks, this is not a given. The cache subsystem and the hardware prefetchers are examples of off core subsystems where access directly to D-Ram can originate without being part of loads and store initiated by an instruction in the core. Here is an example of how PEBS can be tricked. I’ll keep it simple by only accessing memory from one row, but it should be clear how a second row could be added to do row hammering.

1. Let A be a cache set aligned address to an aggressor row
2. Let E be an eviction set for A. E consist of E1..EN where N is the number of ways and E1..EN is not in the same dram bank as A
3. Prime A’s cache set with E.
4. Use clflush to remove a way from of E thus creating a way in the set marked as Invalid
5. Use a store operation (mov [A],1) to set the invalid cache way to Modified, containing A.
6. Now Evict A using E causing a writeback of A.
7. Repeat from 4.

As for 2. This is easily done – there are plenty of addresses belonging E besides A to pick from. 3. This is standard stuff. 4. We may get high latency here but even if clflush is a store operation (which it may or may not be) it will not use A and thus an irrelevant address will be stored by PEBS – also the latency for a clflush is around 115 CLK’s (on my wife’s Sandy Bridge), significantly below that of a cache miss. Further it might not actually be needed. 5. A 4 byte store operation does not load the rest of the cache line from D-RAM (at least on my wife’s Sandy Bridge), thus the latency is low and will not be recorded by Anvil’s PEBS. 6. We do get latency here but we’ll cause a write back for A, but PEBS will record E which is irrelevant. Such a scheme may be too slow for actual row hammering, but I don’t think it is. After all the normal eviction based attack is 70% faster than the requirement of 110k activations in a 64ms refresh interval according to [3]. Even if this turns out to be to slow for row hammering, it demonstrates that the second stage of Anvil may have severe deficits.

Finally we can use the classic clflush method during the 1st phase of Anvil as noted above.

I can come up with other scenarios where this (and in some cases the other) software row hammer mitigation may fail, but I think I’ve placed enough on the table for this blog post.

Conclusion

The current implementation of Anvil is a low overhead row hammer mitigation which will work well against attacks that is not engineered to bypass it. Should Anvil become widespread it is likely that next generation methods of row hammering exists that are capable of bypassing the second stage of Anvil row hammer. Thus if I were to choose a method for row hammer mitigation on a mission critical system I would go with suggestion made by [7] triggering a simple slow down in event of a detection. It has the benefits of thwarting some cache side channel attacks in the process. While this has a much higher performance penalty on a few applications, it’ll run at the same performance cost as Anvil in most real world scenarios and it’s simplicity offers less attack surface for engineered attacks.

Literature

[1] Kalervo Eriksson,”Factors affecting voluntary alcohol consumption in the albino rat”; Annales Zoologici Fennici; Vol. 6, No. 3 (1969), pp. 227-265

[2] Heckman, JamesJ.: “Lessons from the Bell Curve”,Journal of Political Economy, Vol. 103, No. 5 (Oct., 1995), pp. 1091-1120

[3] Zelalem Birhanu Aweke, Salessawi Ferede Yitbarek, Rui Qiao, Reetuparna Das, Matthew Hicks, Yossi Oren, Todd Austin:”ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks”

[4] Yoongu Kim, R. Daly, J. Kim, C. Fallin, Ji Hye Lee,Donghyuk Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages 361–372, June 2014.44 4

[5] Mark Seaborn and Thomas Dullien. Exploiting the DRAM rowhammer bug to gain kernel privileges. March 2015

[6] D. Gruss, C. Maurice, and S. Mangard.” Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript.” ArXiv e-prints, July 2015.

[7] Gruss, Maurice and Wagner: “Flush+Flush: A Stealthier Last-Level Cache Attack” http://arxiv.org/abs/1511.04594

What Stone dream about

Wednesday, March 9, 2016

Anvil& next generation row hammer attacks