Saturday, February 20, 2016

Breaking KASRL with micro architecture Part 1

Hack in the box & stuff

Skip to Introduction if you are here for the techstuff. I submitted a CFP to “Hack In the Box” in Amsterdam end of may and the talk has been accepted. So I’ll be spending part of my vacation this year presenting some of my research on cache side channel attacks from the last year or so. I’m really excited because I think I actually have some really nice stuff to present – not only is it technically interesting but I think it’s far more relevant that the common perception in the infosec community.  If you have a chance I think it’ll be worth while going. Amongst those who is going to present is Jacob Torrey whose material I covered twice on my blog and who in my opinion really is one of those who is pushing the boundaries of info sec. Also Felix Wilhelm is going and I currently have his master thesis on my night table and thus far it is indeed very cool reading.

Preparing a talk obviously takes a lot of time and since I’m an amateur this is going to cut down on the time I can spend on this blog. I have two more posts in the pipeline (including part 2 of this post), but no idea when I can get them done really.

Finally posting this blog post has been a big dilemma for me. I had the basic idea almost a year ago and I’ve been putting it off again and again because it’s offensive and posting it is a moral borderline case for me. I decided to go ahead but with a heavy heart. Also I have a heavy heart because when I started this blog I wanted to target the ambitious newbie. However there has been a tendency that my work has become progressively less accessible and this particular post is probably the least accessible thus far.


KASRL means “kernel address space randomized layout”. It is a feature of most modern operating systems meant to protect against privilege escalation attacks that utilize code reuse attacks such as ROP (return oriented programming). The idea is that if an attacker doesn’t know where the code is, he cannot setup a ROP chain to reuse the code. Therefore a modern operating system load driver on different addresses on each boot and this leaves an attacker in the dark about where the code is located. Code reuse attacks has grown in importance for privilege escalations attacks with requirements for driver signing and wide spread use of “no-execute” settings on stack and other data in kernel mode, thus forcing the attacker to either bypass driver signing to load a driver, attack page tables or reuse existing code.

The attacker has typically relied on information leaks in known drivers that give away addresses that can be used to locate code. However last year I had an idea how we might use the CPU design for our information leak and this will be the topic of this blog post. That we can use CPU design against KARSL is particularly bad news because CPU’s are not subject to updates as software are and in the past intel has not been forthcoming with security fixes in this respect. Also relying on CPU for an attack means that mean thing generalizes across operating systems – this means many of the findings I’ve made here for Windows 7 will lend themselves to other platforms. On the other hand these kind of attacks may be applicable only on a subset of CPU’s. I think that all modern Intel CPU with a 3rd level cache i3,i5,i7 and xeons from Sandy Bridge and up is likely to be affected and possibly other CPU’s as well.

Finally I’d like to say these kind of attacks has been made before but I’m not aware of any with the same methodology as mine.

Finding drivers

The idea evolved around the “Prefetch” instructions. These instructions are meant for software to give the prefetcher a hint to load a specific address into a certain cache level because the software will need it shortly. This can be used to significantly improve performance as the prefetcher runs independent of the CPU and thus memory is fetched before it’ll be used. There is however a interesting thing with the prefetch instructions that caught my eye: They don’t cause exceptions (except they are incompatible with a lock prefix)[2] Intel. However they take a virtual address as input and will need a physical address for loading memory and placing it in the cache. Thus they may entirely by pass memory protection mechanisms further that seem relatively plausible because since they aren’t actually reading data they should not in any way or fashion allow an attacker to read memory he isn’t supposed to. Consequently intel might have decided that there wasn’t a security issue.

My first idea was to use the prefetch instruction in a modified prime+probe or a evict+time attack. That is make a guess about an address in a driver, then use prefetch instruction to load the cache entry and either evict+time or prime+probe to tell if you actual hit the address. This of cause is pretty slow and there is a lot of memory you’d need to run this on before you’d actually find your way around the kernel address space. I’ll get back to this in another post.

The second idea is what if the prefetch instructions leak information through it’s execution time? I thought of 3 scenarios:

1)      Prefetch does not leak information (e.g. the DS descriptor has a limit, prefetcher doesn’t do anything on kernel memory when called from R0, is designed as  constant time etc.)
2)      Prefetch is slower on virtual addresses which actually maps to a page. One reason could be because telling the prefetcher to load the cache takes time
3)      Prefetch is slower on addresses which is not does not map to a page because searching all page table entries for an address is slower than just searching till you hit something.

Well. First I dug into the descriptor issue mentioned in 1) because that’d be a no go and since I recall that 32bit XP actually has a limit on it’s DS descriptor making it useless to access kernel mode memory. It seems though that this is not the case in 64 bit mode as 5.3.1 of the [1] Intel clearly says that limits are not checked at run time. (I think there are some exceptions for FS/GS) So time to load up a debugger and write some code. First I loaded live KD and dumped a map of the drivers using the “lm n t” command. Then I wrote some user mode code that times the prefetcht2 instruction. I picked the prefetcht2 instruction because I assumed loading the largest latency cache was most likely to be actually carried out by the prefetcher and thus most likely to be successful. I then wrote up code to time prefetching an address 20 times and pick the lowest timing found. My argumentation here is that there is noise in memory sub system (hence the multiple accesses to try and cancel it out) and the lowest value because searching for the lower bound would tell me more than an average. Then I picked two addresses to time in this way. One a few thousand pages below the first loaded driver and one in the first loaded driver. Finally I repeated the experiment a 1000 timers. And voila the prefetcht2 instruction is faster on a virtual address that is actually mapped. I speculate that this is because searching through PTE’s is aborted early and thus faster. I think I know how to figure out if this is true and I may get back to it in another blog post. So I moved on and wrote up a small program that’d scan a small portion of the kernel memory address space and tell me if a page was mapped or not and compared it to my module list. It worked beyond any expectation.

So how fast can I scan? It takes about 112 CLK’s for a miss and 85 CLK’s for hit on my wife’s sandy bridge computer. An approximate value (with 20 timings per address) including code over head is that I run around 500k probes per second. With a 64bit address space that too is impractical to scan the entire address space with. However first we should notice that modern operating systems are paging and thus we only need to probe once per page. A page is usually 4096 bytes large. Also on windows 7 it appears that the 16 upper bits are all set for drivers further reducing the problem. Finally Windows 7 loads drivers continuously – on my computer around 100 mega bytes of them – and thus I only have to run through a 48 bit address space in 100 megabyte steps to find where the drivers are hidden. This reduces the problem to a practical 6 second problem. Now I do get a few false positives on this scan. Other stuff than drivers maybe located in this space and I may have some tuning problems and additional probing could further reduce these. But I consistently find the drivers and only a few false positives.

What was a blessing above – that the drivers are clustered into one big chuck of pages – is a curse going forward. The problem is that we do not get the size of the individual drivers and thus is unable to tell which is which for a gadget search. However I believe we can solve this problem with modified versions of evict+time and prime+probe attacks. This however will have to wait for part 2 of this blog post.


I need to express my gratitude to Nishat Herath of Qualys for discussions on this topic I had with him.


[1] Intel: “Intel® 64 and IA-32 Architectures Software Developer’s Manual. Volume 3A: System Programming Guide, Part 1”.
[2]Intel: “Intel® 64 and IA-32 Architectures.Software Developer’s Manual.Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

No comments:

Post a Comment