You know that moment when you get an email with an attached document and a promise of good fortune once you open it? Great, the following set of posts is for you. Lately i have been trying to learn how to speed up analysis of malicious documents (e.g. PDFs, Office, RTFs). Due to the high flexibility that documents provide, a thorough explanation of all their tags and mechanisms would be subject of multiple posts so i will keep the 101s simple. Also, this is an article to analyse PDFs that exploit CVEs. There are multiple ways to deliver malice using pdfs:
- URLs for malicious domains (e.g. phishing)
- Unsafe by design (e.g. launching cmd.exe by clicking somewhere in the document)
PDFs documents are essentially a bunch of objects and tags delimited by %PDF-[Specification Number] and %%EOF. Those tags are typically within objects so, if you open a PDF using a text editor you could see something like:
The Tools and the Approach
If you read across the internet and some books (e.g. Malware Analysts Cookbook), the procedure to analyse these documents seems trivial and goes along the following lines:
- Decode streams using pdf-parser.py, pdf.py (part of jsunpack-n), pdftk.py or pdfextract.
- If the document is rigged to exploit a CVE, extract the shellcode and triage it using sctest/sdDbg or turn the shellcode into a PE file using shellcode2exe. Leverage a debugger for the latter. You can extract the shellcode manually or, if possible, use jsunpack-n.
Steps 1-3 take some time but are straightforward. The same applies to step 5. However, step 4 is pain due to the following factors:
- I have found that complex constructions such as:
lead to errors where variables are not known. On the example above, when leveraged Spidermonkey with or without D. Stevens extension, i got a persistent error indicating that s is not known within the context of yo.
In conclusion, the amount of tools we have to parse PDFs and analyse them statically is more than enough but existent analysis guides tend to be pretty optimistic and skip lots of inconvenient cases. While i can’t describe every case, the analysis of the documents below, together with some notes and tips should be enough for you to start analysing malicious PDF documents. The cumbersome part tends to be the shellcode for reasons discussed at the end if this post. I will be using REMux for the analysis so some commands are shortened (e.g. python pdfid.py -> pdfid).
If you open the document within an isolated environment you will notice that it contains a bunch of links for what seem to be legitimate domains. We need to look deeper.
First, we enumerate:
or, as i prefer, you can use:
pdfextract -j [PDF File Name] -o [Output Dir]
Copy the whole script, strip out the Adobe tags (e.g. templace xmlns…), use js-beautify to indent the code and push all the functions to the top leaving this sequence at the bottom:
One common trait of obfuscated scripts is the existence of a function responsible for decoding internal strings. The pattern is typically the same: calls to a function with a random set of characters as one of the arguments. In this case the offender is fW1X662. Now we extract all the calls, we put them inside print and we put that inside a JS file together with the functions we pushed to the top. This should do part of the trick:
cat [JS File] | grep -E -o "fW1X662[ ]*\('[^)]+\)" | while read -r line; do echo "print(\"$line->\" + $line );" done
You should have something like:
Notice the variables at the top. When you do selective execution of code it is likely that you will see the compiler complaining about missing variables. An interesting variable here is xfa. If you try to execute the file we have built without that you will get an error: ReferenceError: xfa is not defined. This is where running PDF code in a JS engine fails. xfa is an object provided by Adobe to help interaction with forms. If you take a closer look at its usage within the code:
you see that this is likely a mechanism to detect a non PDF-engine like environment. It just checks whether xfa is present. Simply put xfa = true;. Execute the file with SpiderMonkey (i.e. js) or Windows cscript/wscript, get the results and replace the calls on the original file for the strings. Typically this would make the script simpler but in this case not really 🙂
Moving forward, we know for sure how relevant the software version check is and on Adobe, you use the app object for that purpose. Check the following:
As you can see, a massive if block is executed if the reader version is above or 220.127.116.11 and then you have blocks for version < 18.104.22.168 and <22.214.171.124. Now that we know the cases, we want some shellcode. We have to trick the code into thinking it is executing within a vulnerable environment. Furthermore we are interested on the variable nee (what does it have at the end?). Changes:
Test with 8.2 and then with the other one. Once you run this again, you will get an error ace is not defined (search for ace[cz] = yj;). This is one of the last lines of f0G32YR. ace is an object associated with Adobe color engine. For some exploits there are at least two phases:
- Filling memory with NOP sleds and shellcode
The ace part is the vulnerability triggering while the nee is the shellcode. Comment out ace. This sample is tricky since there is no visible variable with a large chunk of shellcode as some other samples. In this case multiple function calls are performed in loop to generate a massive amount of shellcode:
This loop will fill nee with multiple chunks of repeated shellcode to increase the likelihood of the execution falling within shellcode. We cannot dump this shellcode at the end since it is massive and contains non-important stuff. Shellcode is typically managed by scripts in escaped Unicode (e.g. 0x9090 becomes %u9090). The shellcode is “unescaped” using the unescape function. Searching for it yields:
The print was added by me. You will get two prints: the first is the shellcode and the second a tiny nop-sled. Copy the big chunk of shellcode to a file.
Now we can process/analyse the shellcode in one of two ways:
- Static: using ConvertShellcode or rasm can be used to disassemble shellcode provided as a string. Shellcode tends to resolve dependencies in runtime so plain shellcode may be tricky to analyse. The absence of plain strings (there are decoding routines) makes this sample even harder to analyse using such approach.
- Dynamic: shellcode2exe.py + debugger, libemu sctest (shellcode emulator) or scDbg. The latter tends to work better since some shellcode has ROP garbage which is “ignored” by scDbg but not sctest. If you want a quick answer, emulators are your best bet. You can also, leverage the Olly Advanced plugin to load raw bytes of shellcode from a file into a memory region of “your choice”. You then have to set the EIP to the beginning of the shellcode in memory and step through from there.
Also, REMux has a couple of converters (e.g. unicode2raw, unicode2hex-escaped). Since the output of the previous script was unicode, we can do:
# Flags: #-S-> read from stdin #-s -> number of steps #-v -> verbose cat [script output] | unicode2raw | sctest -Svs 1000000000
However, you see the output of sctest as:
While the outputted warnings/strings are normal, the fact that only one step took place is indicative that something failed. Using scDbg:
Assuming both emulators fail, you need to debug the code. You can load the shellcode bytes on the debugger or convert raw shellcode to an executable:
python shellcode2exe.py [Raw Shellcode]
Load the resulting exe on Olly or Windbg. You should see this as the entrypoint:
After multiple “Access Violations”, you will realise that the entrypoint is here. The remaining code may be designed to work in an ideal exploitation environment (e.g. within Acrobat Engine). All API resolutions use API hashing. You should observe the following chain of events:
- Shellcode and some string decoding (e.g. URL http[:]//ladh[.]cz[.]cc/w/l.php?x=1001)
- urlmon load and resolution of URLDownloadToCacheFileA. Leverages the latter to access the URL and download malware
- Resolves CreateFileA and opens the cached file stored on the path provided by URLDownloadToCacheFileA
- Resolves and uses memory mapping functions (e.g. CreateFileMappingA, MapViewOfFile, UnmapViewOfFile) to map the downloaded malware in memory. FlushViewOfFile is used to flush any changes to the original file. The reason for this is not clear since, using a test binary, nothing changed. It is my understanding that URLDownloadToCacheFileA creates a temporary file and, by combining CreateFileA and memory mapping functions, the file is persisted (guessing here).
- Resolves library shell32 ShellExecuteA and runs the downloaded file.
So nothing surprising here. Just exploit leveraged to download and execute malware. While a thorough analysis is welcome since we are in a position to tell what versions are affected, we can automate the shellcode extraction a bit. We will leverage jsunpack-n to extract the shellcode. jsunpack-n leverages Mozilla’s SpiderMonkey with some hooks (e.g. unescape, as i previously referred) to extract some useful information such as eval’d strings, shellcode, etc. Let us see how it handles the PDF:
# Having the JS in file. #-V means very verbose. #-f means fast analysis. With fast evaluation. I recommend leaving this out so the analysis is deeper. jsunpackn.py [JS Path] [-f] -V
For the command you should get something like:
- >= 8 and < 126.96.36.199
- >=8 and <188.8.131.52
Also, it seemed like the affected element was the Adobe Color Engine (ACE). VirusTotal comments refer Pidief and Pdfka. While searching for these i did not find any reference to ACE. However searching for ACE vulnerabilities holds:
- CVE-2010-3622: Adobe Reader and Acrobat 9.x before 9.4, and 8.x before 8.2.5 on Windows and Mac OS X (memory corruption)
- CVE-2011-0598: Adobe Reader and Acrobat 10.x before 10.0.1, 9.x before 9.4.2, and 8.x before 8.2.6 on Windows and Mac OS X (integer overflow)
Problem1: What versions are exploited? What is the CVE?
Problem2: What does the shellcode do?
Approach: Try jsunpacker-n and sctest/scDbg. sctest may fail if the shellcode contains instructions with ROP garbage. scDbg tends to be better since it attempts multiple execution paths and lets you choose which one to take. Worst case scenario, use shellcode2exe and debug the shellcode. Bear in mind that ROP code may hinder the analysis. Check the following section for an exotic example.
Problem3: What if it is not an exploit? What if is just a document with some tricks exploiting design?
Approach: The Approach1 solves part of this. Leverage D.Stevens tools and peepdf to perform some reckon on the file. The command provided on this post should suffice. Look for suspicious tags, understand relations between objects. Look for embedded URLs, etc.
As a general note, specially for JS interpreters and automated deobfuscators, sometimes you may need to tweak the scripts a bit so they run. This has to be done on a case by case basis.
Shellcode garbage and entrypoint
While going through a couple of documents i have realised that running the shellcode as extracted from the documents is no simple task. Shellcode used on exploits have added instructions that assume some values on registers (e.g. addresses). Running the shellcode blindly will lead to access violations/crashes. On the presented case, the core functionality started a couple of lines after the “initial entrypoint”. This is not always the case. On some cases, the shellcode will search for tags (typically byte sequences) on the loaded document. This approach can be used to decode and deploy embedded malware or execute more shellcode. Take the example of: 721601bdbec57cb103a9717eeef0bfca. This PDF exploits CVE-2010-1297, an exploit that leverages a crafted SWF file. If you extract the shellcode, you should have something like:
A proper jmp, call or decoding loop are nowhere to be found. A strings on the shellcode shows a suspicious string: c:\-.exe. While wondering around the shellcode on IDA i spotted this:
Basically, the shellcode is looking for two offsets JucK and F.Zh. Whatever is between those offsets is XORed with other value. The vertical arrows indicate normal execution. You should set the EIP to the instruction at the beginning of those arrows and execute the code until the end (i.e. arrow tip). For the other arrow, you must execute the instructions until the beginning of it, jump (set EIP) to the instruction at their tip and proceed from there (0x401323->(after) 0x401336).
If you load the document somewhere in memory and follow the instructions above, you should be able to spot the following transformation (make sure you write down the value of ebx before and after the loop):
The decoding trick employed by the shellcode is pretty clever since a byte stream is generated based on the lower hex byte for initial size of the section to be decoded (0xFC) and decrements at each loop. The byte stream is xored against each byte of the malware.
We have a broken PE file which is likely dropped as c:\-.exe. Knowing the ebx prior (iebx) and pos (febx) decryption, you can dump the memory section. Dump between [iebx, febx -1]. Now we open the malware on a hex editor (e.g. HexWorkshop) side-by-side with a well-formed exe. You will notice that the first readable character “@” appears at an offset of 18h on most exe files. Our malware has it on 14h. Also, there is no MZ at the beginning so we need to pad the beginning of the malware and add the MZ:
With this simple correction i got a malware with the signature b7a7d58de7f0399f0a14a594df384fc9 recognised by tools like IDA and PE explorer. The malware is pretty simple and the imports and C2 seem to be decoded (some strings):
- ServerAddress=http[:]//210[.]211[.]31[.]214/ddradmin/ddrh.ashx -> C2!!!!
- Processors :
- Group name : %s
- Comments : %s
- Members :
- System Information:
- Last Update Patches:
- User Information and administrators group:
- Network Resources:
- Installed Applications:
- Installed Services:
- Browser Information:
- DNS server(s): %s
- Physical Address: %02x-%02x-%02x-%02x-%02x-%02x
- Host Name: %s
On this case, i managed to stumble upon a simple routine of XOR. I was able to skip the middle calls without affecting the final binary. Also, the binary was not broken beyond repair so a simple comparison with a standard header solved the problem. On more complex cases, it is better to simulate a vulnerable environment and debug the exploited application. I have tried to run this code using scDbg but could not see anything meaningful happening.
On this post i attempted to tackle common difficulties when analysing malicious PDFs carrying exploit code. It is my understanding that the show stoppers for the analysis are:
- What is exploited? Versions? CVE?
- What does the shellcode do? Contacted domains? Host indicators?
- If the file is malicious, it is likely that OSINT can help you
- You can analyse it manually and search for version checks
- You can leverage jsunpacker-n and see if it can detect the CVE
As for shellcode analysis, the simplest way is to either manually hook the typically used unescape (e.g. print before/after the call) or use tools like jsunpacker-n to get it for you. Once you have the binary shellcode, load it on a emulator or debugger. If the former does not work, and since shellcode has typically extra code that accounts for a proper exploitation environment, your best bet is to open it on IDA and try to locate primitives like: PUSHA, LOOP (and derivatives), JMP, PUSH EBP, etc. Also, the existence of random mov’s of hex strings before calls to functions may indicate API hashing. In the end is all about looking for what stands out. Patience is key here.
Stay safe 😉