It seems we have reached the final post. Previously, i have addressed PDFs containing exploits and Office documents containing macros and exploits. This post will be lighter than the others since i won’t be doing full analysis of documents. I have shown you before how you can analyse embedded shellcode once you have it so i won’t repeat the topic.
Malicious RTF files tend to exploit vulnerabilities on software such as Word so you may wonder why am i dedicating a post to a case that could fit on one of the previous posts. Well, for starters, there are extra tools and RTF files have a structure that resembles PDFs. Also, when i was searching for specimen for analysis, i have realised that most of the exploits against Office tend to leverage RTF files so i wanted to dedicate a post to them.
This post will focus on tools and the analysis approach. I will start as usual by overviewing RTF file structure so you understand where the good stuff may be.
RTF Files 101
Say you want to have a file with the string “SecurityOverSimplicity”. The simplest way to represent a string is by converting its characters to a sequence of bytes using an encoding (e.g. ASCII, Unicode). As an example, create a file using a simple test editor (e.g. Sublime) and write the string “SecurityOverSimplicity”. Then, use:
hexdump -C SecurityOverSimplicity.txt
As you can see, the string is stored as a sequence of ASCII-encoded bytes. If you check the file with the command type you will get “SecurityOverSimplicity.txt: ASCII text, with no line terminators”. Simplicity is bliss but you are ambitious. You want different types of fonts. You want bold and put some flashy images. By now you should see that, in order to get something fancy you need to add a bit more of information. That is where control words in and groups come into play (on RTF files):
Basically, the above document will contain a string SecurityOverSimplicity with the font Times New Roman. fs60 dictates that the string should have a size of 30pt (half of 60). Assuming more sophisticated documents, you can have tags such as \pict for pictures or \objdata for embedded objects among others. As an example, for an embedded image, the following picture shows how the RTF file would look like (truncated):
As you can see, the picture is encoded into a sequence of hexadecimal characters. Fortunately, there are tools to decode this type of data. Why is this relevant? Where do exploits come into play? First, the targets of exploitation tend to be parameters for control words. Second, while there is a standard for RTF, applications like Word tend to be extremely flexible when it comes to parsing and processing (e.g. extra data within groups can be ignored by Word instead of causing a parsing error).
It is my understanding that such flexibility explains the huge amount of exploits through RTF files. I have found this files to be slightly harder to analyse when compared to Office documents. While with Office documents, a linear scan using scDbg or OfficeMalScanner does the trick most of the times, with RTF files you are required to know a bit more about the exploited component (e.g. affected control word) to find and extract the shellcode.
Flexibility has also caused tools such as the ones i will refer next to fail in the past when analysing files that did not conform with the main specification. For a discussion on standards vs. reality and processing issues, please refer to Decalage or this Sophos paper.
The Tools and the Approach
As far as i am aware, the following tools can be used to extract information from RTF documents or documents leveraging the OLE technology:
- rtfobj.py: for simple object extraction
- pyxswf.py: may be used to extract embedded Flash files
- rtfdump.py: rtfobj.py on steroids. Allows you to dump the hierarchies of groups, extract encoded data in different formats (e.g. hexdump, binary), cut data, scan with YARA rules, etc.
- OfficeMalScanner’s RTFScan: similar to OfficeMalScanner referred on previous posts but for RTF files. It is able to extract embedded objects and find shellcode.
In terms of the analysis, the approach tends to be running RTFScan to dump any embedded files and find shellcode. The dumped files, if OLECF can be scanned again with OfficeMalScanner. If nothing comes up, you have to understand what is the exploited component within the RTF file (e.g. listoverridecount for CVE-2012-2539) and try to extract the data associated with it. The shellcode may be stored within a large binary file within an embedded OLECF to increase the chances of execution or somewhere within a large encoded sequence of bytes after some RTF control word.
These are just tips, not rules!
As previously referred, unless i find a reason to do so, no shellcode will be analysed. The purpose of these examples is to explain to you what you should do in general and how to deal with tricky cases.
This file exploits CVE-2017-0199 which is a sad feature similar to DDE called OLE2Link. It basically allows an attacker to craft a document that when opened will fetch and execute a HTA file from a remote host. This exploit requires a LinkObject embedded within the document. Some reckon:
There are only two embedded objects in this case. Object with id 148 per Rtfdump.py is what you are looking for:
In this case i have specified the encoding for the strings to be Unicode. Basically, new.hta will be downloaded and executed if you open the document and click OK on the warning you will get.
This is a file that leverages CVE-2010-3333, a stack overflow exploitable through the control word pFragments. It is also a corner case where automated extraction using RTFScan fails. There is basically a large string embedded within the sv control:
It follows that the shellcode is somewhere within that string. According to OSINT, the shellcode starts after the acc8 or 0xc8ac in hex. Regardless, once you locate and dump the content of that control word argument as such (please refer to Didier Stevens’ blog for instructions regarding rtfdump.py flags):
scDbg is able to locate the multiple potential entrypoints for the shellcode:
scDbg should be ran against the bytes extracted using pdfdump.py and not the string itself.
This document exploits CVE-2015-1641, a vulnerability that leverages memory corruption for remote code execution. Below is the output of the multiple tools i have previously referred:
As you can see, RTFScan and rtfdump.py report four embedded objects while rtfobj.py reports only three. The fourth is datastore which, while being an embedded object, appears on many legitimate RTF files. Regardless, you are left with four OLECF files with the following MD5s:
- File1: 06b951cce02b2cd5864161fc16660284
- File2: 8ad872d6cb78fc9a0cdabdae32d1dd72
- File3: a0ee8aa8c12e0d8842dbe007e7825c76
- File4: c13c07027f7d219b61436222bb51e09c
You can inspect the files using any of the tools i have referred on the previous posts about Office documents. I have also discovered that 7-ZIP is able to open OLECF files. You will notice that only File2 and File3 actually contain something (the others seem empty). What distinguishes both OLE files is the existence of an activeX folder on File2 with two .bin files which happen to be OLE files as well (notice the size!):
If you scan those bin files with OfficeMalScanner it will report that activeX1.bin contains a shellcode sequence (as shown on the picture above). Let us put that theory to test:
Looks like shellcode to me. You may wonder why i did not run OfficeMalScanner against the embedded OLECF files at the very beginning. I did and it did not find the shellcode.
Regarding the posts about exploit analysis, the following disclaimer should be obvious:
The posts i have created about exploit analysis are not for unknown/0-day vulnerabilities. I have also paid little to no attention to the source and the mechanics of the exploits (unless strictly necessary). Those posts were meant to extract shellcode and understand its functionality fast.
As you may have noticed, RTF files require extra effort since the shellcode may be encoded within control word arguments, embedded OLECF files or some other embedded resource. However, the following heuristics, together with the ones provided for Office documents should get you started:
- Scan RTF documents with RTFScan.exe. The dumped OLECF files (if any) should be scanned using OfficeMalScanner.exe.
- If OfficeMalScanner indicates some offsets for shellcode (within the aforementioned OLECF files), leverage scDbg to get an automated analysis. Otherwise, leverage OfficeMalScanner and scDbg to obtain potential offsets. Analyse manually as explained on previous posts.
- If at this point you still have no shellcode offsets:
- Inspect the OLECF files (if any) using tools like 7-ZIP or the ones referred on my previous posts. See if any large file stands out and run it through scDbg.
- Look for NOP sleds (e.g. 90h sequences) within the main document or files extracted from it.
- Still no shellcode? Investigate the vulnerability a bit more to get some context. Answer the following questions: Where is the vulnerability triggered? Is it after a control word or an embedded object? Dump the arguments for the control word and proceed from there.
What if none of these work? Bear in mind that i am not reproducing the vulnerable environment so it is likely that some CVEs won’t be analysable without actually triggering the exploit. If you have the exact vulnerable environment (e.g. VM with vulnerable Office version) you can open the document, stop the VM and then analyse the memory using Volatility and a plugin like malfind. You can also attach the debugger to the vulnerable software and observe the exploit as it progresses.
Stay safe 😉