Following the previous post, i am now going to overview the analysis process for exploits within Office documents. You see, while with PDFs, you have a format and a reader (e.g. Adobe reader), with Office you have lots of acceptable formats and a reader. For instance, Word 2013 is capable of handling the following formats:
- PostScript, and PDFs since they rely on a subset of the Postscript language
- Open XML
The story is not different for other Office components (e.g. Excel). Word tends to be heavily exploited through crafted RTF files. I will be addressing RTF files separately since they are a whole different mess. For now i will focus on getting a generalised approach to analyse exploits using other supported formats.
The Tools and the Approach
As with PDF documents, the tough part is extracting the shellcode and understanding its functionality. While with PDF files you have JS doing the heap spray, with Office documents, there is no scripted heap spray. What attackers do is typically embed the shellcode somewhere within the document. For OpenXML format, some exploit documents have large bin files within the compressed archive containing the shellcode. This increases the chances of hitting the latter when the exploit takes place. Bear in mind that embedded content may perform heap spraying (e.g. Flash which is compiled ActionScript).
OfficeMalScanner can be leveraged to find both shellcode and potential embedded files (e.g. OLE, PE but not Mach-O or ELF). As an alternative, pyxswf.py may be used to extract embedded Flash files from OLE structures. OfficeMalScanner is also capable of bruteforcing simple encodings such as XOR, ADD, ROR, to detect embedded files that are encoded. Assuming you can spot the shellcode within the file or OfficeMalcanner provided one or more offsets, jmp2it allows you to “jump” to the shellcode. jmp2it loads the file in memory before passing execution to its stub, which can be useful when the shellcode expects the file to be somewhere in memory. The other tools i have referred on the previous post are also usable (e.g. oledump).
As i previously referred, with Office exploits, the shellcode is typically embedded within the file. scDbg can be useful in such case assuming the shellcode simply resolves dependencies and downloads malware. However, for cases where files are dropped, scDbg does not seem to cut it even though it finds valid execution paths.
I will now analyse three documents with distinct degrees of difficulty. At the end of the post i will provide some guidelines for you to start doing your own analysis.
This sample exploits CVE-2010-3970, a buffer overflow vulnerability in the Windows graphics rendering engine. We are interested in the shellcode so we start with scDbg on the original document:
Well, that was pretty simple, right? Next!
This document exploits CVE-2009-0563, a vulnerability that affects Microsoft Office for Mac. If you try to understand what the shellcode does using scDbg, you will not have much luck. You will be given four offsets but all executions lead to nothing. OfficeMalScanner dumps an OLE file and provides a couple of potential offsets for the shellcode:
Now, for the shellcode:
The top image applies to the offsets 0x1719b, 0x172d3, 0x172dd and 0x172e7. I have no idea what 0x8FE01000 means but i assume it somehow leads to the offset 0xc20:
Where 0xD7C is the dropper function:
The shellcode relies on system calls to do the job. System calls on OSX have a somewhat funny semantics. The arguments are pushed onto the stack, but there is always a dummy push before the system call. IDA assumed the binary was for Linux and named the system calls wrongly. I have corrected that using this. Basically, three files are created and written:
- /tmp/launch-hs: script to run the files below
- /tmp/launch-hse: malware
- /tmp/file.doc: fake document to trick the victim into thinking that nothing happened
We now know there is malware embedded. We also know this document affects Mac OS so it is likely that the embedded malware is a Mach-O file. If we take a look at the document and we look for Mach-O headers (e.g. FE ED FA CE and FE ED FA CF) we can see:
Now we need to extract the binary. We know where the first character for the embedded OLE document is:
A good approach would be dumping whatever is between 0x7632 and 0x1e685. Let us do it (not explaining the math here :)):
dd if=[Original Document Path] bs=1 skip=0x7632 count=0x17054 of=[Malware Path]
You should obtain a file with the md5 d4c021c2af0af225287581d43aab4008. Now we have to take a look at the Mach-O headers (use the command otool -l [Malware Path]):
While for PE files there is the concept of sections, for ELF and Mach-O files, there are segments and sections. Segments are associated with runtime while sections are for linking purposes. In any case, roughly speaking, a segment is like a PE section. We can, therefore, try to understand the size of the Mach-O file by looking at the physical offset of the last segment and its size. In this case, the offset is 32768 and the size is 11720. If you use the previous dd command but replace count with the sum of the previous values you will get a file with the md5 89c35c057655e67580efd0ff8242d960 (VirusTotal URL).
This document is an Excel file that exploits CVE-2011-0609, a vulnerability on Adobe Flash. According to OSINT, this document was leveraged to attack RSA years ago. Using OfficeMalScanner for preliminary analysis:
we can obtain the SWF file embedded (md5 8ea327df5270b73bcdaa8c893f0e92b7). I have not introduced the analysis of Flash files on this blog even though i plan to. In any case, we can obtain the disassembled Flash file by running:
swfdump -a [SWF FILE PATH] > [DISASSEMBLED SWF FILE PATH]
Looking through the code we can easily spot two massive hex strings:
You can convert the hex strings to binary data using the following command:
cat [SHELLCODE HEX STRING FILE PATH] | xxd -r -p > [BINARY DATA FILE PATH]
The first hex string contains the shellcode while the second contains a CWS file. The CWS file is crafted in a way to exploit the vulnerability when loaded a few lines below:
If you try to run the shellcode using scDbg you will get stuck because the shellcode will look for file handles. We cannot use jmp2it because the shellcode is generated by the Flash file so we have no offset to pass as argument. We could write a simple program to open the file and get a handle for it. We could then load the shellcode within the process space but we don’t know what the shellcode is looking for (e.g. it may search for an offset within the crafted CWS which is only available if the original document is opened). We can start by running the shellcode and modifying its flow on the fly as needed:
Nothing unusual here. Dependency resolution and then a “bruteforced search” for the handle for the Excel file. Since the search for the file handle will fail (file is not loaded in memory), the ReadFile will also fail so you need to ignore that part. We then see a pattern similar to something we have seen before:
Basically, the shellcode is looking for two offsets: C.BG and FUcK. It is likely looking for something (e.g. malware). You set the EIP for the offset of “inc eax” and follow the instructions above. We now resume execution as shown below:
The dropped file was PoisonIvy b15b4a89bfdb0e279647ab6c14e80258 (VirusTotal URL) .
As you can see, when compared with PDF analysis, the culprit is always the analysis of the shellcode. Assuming there are no components embedded within the documents that generate shellcode (e.g. Flash content), the shellcode will be within the document structure. scDbg can be of great use if the shellcode downloads and deploys malware. However, for droppers that require the malicious document to be in memory, it is best to use OfficeMalScanner or even scDbg to determine the beginning of the shellcode and then use jmp2it to support its manual debugging.
For dynamically generated shellcode (e.g. by Flash), it is best to analyse the generator and extract the shellcode. The analysis proceeds with a typical debugger. While shellcode structure may vary, the idea tends to be the same:
- Resolve dependencies or use hardcoded addresses (e.g. for non ASLR systems)
- Download or drop (after optional decoding) malware to be then executed. The downloaded malware may also be encoded.
Droppers tend to be harder to analyse assuming ROP garbage and encoded payloads. However, the trick tends to be:
- Locating decoding loops (e.g. xor, mov, loop and variants)
- Understanding which register(s) is (are) being initialised with the memory offset for the document
- Manually load the document/encoded malware in memory and adjust the register(s) accordingly
- Set EIPs manually and skip instructions that may cause memory violations
For shellcode that contains egg hunters (i.e. instructions to search for more shellcode in memory) the process is very similar.
Most of the tips i provide for exploit analysis on (Not) All She Wrote (Part 1): Rigged PDFs apply for Office documents (e.g. emulators, plain debugging, OSINT), so please refer to that post for more information.
On the next and last post(s) for this series i will analyst RTF files. Until then…
Stay Safe 😉