(Not) All She Wrote (Part 1): Rigged PDFs

Hello paranoids

 You know that moment when you get an email with an attached document and a promise of good fortune once you open it? Great, the following set of posts is for you. Lately i have been trying to learn how to speed up analysis of malicious documents (e.g. PDFs, Office, RTFs). Due to the high flexibility that documents provide, a thorough explanation of all their tags and mechanisms would be subject of multiple posts so i will keep the 101s simple. Also, this is an article to analyse PDFs that exploit CVEs. There are multiple ways to deliver malice using pdfs:

  • Exploits
  • URLs for malicious domains (e.g. phishing)
  • Unsafe by design (e.g. launching cmd.exe by clicking somewhere in the document)

PDFs 101

 PDFs documents are essentially a bunch of objects and tags delimited by %PDF-[Specification Number] and %%EOF.  Those tags are typically within objects so, if you open a PDF using a text editor you could see something like:

PDF format - objects, dictionaries and references
PDF format – objects, dictionaries and references

 On the example above, “/ Pages 4 0 R” says “/Pages” references “obj 4 0”. The first number is the object id while the second is the version (updated documents -> new versions).  From a reversing/analysis perspective we need to focus on red-flag tags. Lenny Zeltser from SANS keeps a cheat sheet for the tags you should lookout for here.  On the analysis section i will explain how to list the interesting tags on the document using some tools.  As you see, it is possible to embed JavaScript within PDF documents. Following this idea, there is another concept that i must introduce before moving on: streams. Streams are a means to store miscellaneous data such as images, strings, etc. It is possible to leverage encoding/compression algorithms  to store that data on those streams.  This way of storing data can be used by malicious actors to obfuscate strings and JavaScript:

Flat Encoded Data Stream
Flat Encoded Data Stream

  There are multiple algorithms to encode data (e.g. Flat, LZW). The tag that indicates encoding is Filter. When analysing PDFs containing exploits, the idea is to focus primarily on the JavaScript within. Also, some exploits rely on media such as images, SWF files, etc. Those will either be embedded within the document or as bytes of some JS variable. Armed with this basic knowledge of PDF structure it is time to get our hands dirty.

The Tools and the Approach

 If you read across the internet and some books (e.g. Malware Analysts Cookbook), the procedure to analyse these documents seems trivial and goes along the following lines:

  1. Leverage a tool such as pdfid.py or peepdf.py to perform some reckon on the documents (e.g. suspicious tags, potential JavaScript code). Look for risky tags and scripts, basically.
  2. Decode streams using pdf-parser.py, pdf.py (part of jsunpack-n), pdftk.py or pdfextract.
  3. Copy the resulting JavaScript (if there is such thing there) to another file. To extract flash, you can leverage the tools on 2. or swf_mastah.py.
  4. From then on use SpiderMonkey and Didier Stevens’ extension, jsunpack-n (not an engine) or whatever JavaScript engines you can find. For SWF files use a tool such as SWFTools’ swfdump to obtain the ActionScript.
  5. If the document is rigged to exploit a CVE, extract the shellcode and triage it using sctest/sdDbg or turn the shellcode into a PE file using shellcode2exe. Leverage a debugger for the latter. You can extract the shellcode manually or, if possible, use jsunpack-n.

 Steps 1-3 take some time but are straightforward. The same applies to step 5. However, step 4 is pain due to the following factors:

  • No debuggers for PDF files containing JavaScript: I have tried Adobe Pro debugger and it is basically a dead-end (i.e. unusable, limited, weird).
  • SpiderMonkey is Firefox’s JavaScript engine. Combined with Didier Stevens plugin you can dump evals and document writes. However, if the JavaScript code uses objects provided by the PDF engine such as app, you are out of luck. Exploit code tends to fingerprint the software rendering it and attempts an exploitation tailored for the version, which explains the usage of app. jsunpack-n is able to emulate some objects provided by the PDF engine most of the times.
  • I have found that complex constructions such as:
Unknown variable s
Unknown variable s


lead to errors where variables are not known. On the example above, when leveraged Spidermonkey with or without D. Stevens extension, i got a persistent error indicating that s is not known within the context of yo.

 In conclusion, the amount of tools we have to parse PDFs and analyse them statically is more than enough but existent analysis guides tend to be pretty optimistic and skip lots of inconvenient cases. While i can’t describe every case, the analysis of the documents below, together with some notes and tips should be enough for you to start analysing malicious PDF documents. The cumbersome part tends to be the shellcode for reasons discussed at the end if this post. I will be using REMux for the analysis so some commands are shortened (e.g. python pdfid.py -> pdfid).

Analysis: 32c83ebbca9cbfe298ad270fa50bc5d3

 If you open the document within an isolated environment you will notice that it contains a bunch of links for what seem to be legitimate domains.  We need to look deeper.

First, we enumerate:

 Peepdf tells us that object with id 90 has JavaScript. The first image shows you the count for tags that are considered suspicious. From here you can handle this in one of two ways. The first is keeping the reckon:

pdf-parser -r [Object Id] [PDF Name] # Find all tags referencing the Object.
pdf-parser -s [Expression] [PDF Name] # Search for a term with any object, e.g. JavaScript, JS
pdf-parser -o [Object ID] -f -w [PDF File Name] > [Decoded File Out] # Decode given object. -f means "decode" while -w means pass the raw contents of the object to the decoder.

or, as i prefer, you can use:

pdfextract -j [PDF File Name] -o [Output Dir]

to get all embedded JavaScripts. If you dump all the streams you will see that the JS is indeed in object 90. 

 Copy the whole script, strip out the Adobe tags (e.g. templace xmlns…), use js-beautify to indent the code and push all the functions to the top leaving this sequence at the bottom:

Loose code

 One common trait of obfuscated scripts is the existence of a function responsible for decoding internal strings. The pattern is typically the same: calls to a function with a random set of characters as one of the arguments. In this case the offender is fW1X662. Now we extract all the calls, we put them inside print and we put that inside a JS file together with the functions we pushed to the top. This should do part of the trick:

cat [JS File] | grep -E -o "fW1X662[ ]*\('[^)]+\)" | while read -r line; do
	echo "print(\"$line->\" + $line );"

You should have something like:

Forced Decoding
Forced Decoding

 Notice the variables at the top. When you do selective execution of code it is likely that you will see the compiler complaining about missing variables. An interesting variable here is xfa. If you try to execute the file we have built without that you will get an error: ReferenceError: xfa is not defined. This is where running PDF code in a JS engine fails. xfa is an object provided by Adobe to help interaction with forms. If you take a closer look at its usage within the code:

you see that this is likely a mechanism to detect a non PDF-engine like environment. It just checks whether xfa is present. Simply put xfa = true;. Execute the file with SpiderMonkey (i.e. js) or Windows cscript/wscript, get the results and replace the calls on the original file for the strings. Typically this would make the script simpler but in this case not really 🙂

 Moving forward, we know for sure how relevant the software version check is and on Adobe, you use the app object for that purpose. Check the following:

 As you can see, a massive if block is executed if the reader version is above or and then you have blocks for version < and < Now that we know the cases, we want some shellcode. We have to trick the code into thinking it is executing within a vulnerable environment. Furthermore we are interested  on the variable nee (what does it have at the end?). Changes:

app override
app override

 Test with 8.2 and then with the other one. Once you run this again, you will get an error ace is not defined (search for ace[cz] = yj;). This is one of the last lines of f0G32YR. ace is an object associated with Adobe color engine. For some exploits there are at least two phases:

  • Filling memory with NOP sleds and shellcode
  • Trigger the vulnerability through the JavaScript

 The ace part is the vulnerability triggering while the nee is the shellcode. Comment out ace. This sample is tricky since there is no visible variable with a large chunk of shellcode as some other samples. In this case multiple function calls are performed in loop to generate a massive amount of shellcode:

shellcode generation
shellcode generation

 This loop will fill nee with multiple chunks of repeated shellcode to increase the likelihood of the execution falling within shellcode.  We cannot dump this shellcode at the end since it is massive and contains non-important stuff. Shellcode is typically managed by scripts in escaped Unicode (e.g. 0x9090 becomes %u9090). The shellcode is “unescaped” using the unescape function. Searching for it yields:



 The print was added by me. You will get two prints: the first is the shellcode and the second a tiny nop-sled. Copy the big chunk of shellcode to a file. 

Now we can process/analyse the shellcode in one of two ways:

  • Static: using  ConvertShellcode  or  rasm can be used to disassemble shellcode provided as a string. Shellcode tends to resolve dependencies in runtime so plain shellcode may be tricky to analyse. The absence of plain strings (there are decoding routines) makes this sample even harder to analyse using such approach.
  • Dynamic: shellcode2exe.py + debugger, libemu sctest (shellcode emulator) or scDbg. The latter tends to work better since some shellcode has ROP garbage which is “ignored” by scDbg but not sctest. If you want a quick answer, emulators are your best bet. You can also, leverage the Olly Advanced plugin to load raw bytes of shellcode from a file into a memory region of “your choice”. You then have to set the EIP to the beginning of the shellcode in memory and step through from there. 

 Also, REMux has a couple of converters (e.g. unicode2raw, unicode2hex-escaped). Since the output of the previous script was unicode, we can do:

# Flags:
#-S-> read from stdin
#-s -> number of steps
#-v -> verbose

      cat [script output] | unicode2raw | sctest -Svs 1000000000

However, you see the output of sctest as:

Failed sctest
Failed sctest

 While the outputted warnings/strings are normal, the fact that only one step took place is indicative that something failed.  Using scDbg:

scDbg output
scDbg output

  Assuming both emulators fail, you need to debug the code. You can load the shellcode bytes on the debugger or convert raw shellcode to an executable:

          python shellcode2exe.py [Raw Shellcode]

Load the resulting exe on Olly or Windbg. You should see this as the entrypoint:


 After multiple “Access Violations”, you will realise that the entrypoint is here. The remaining code may be designed to work in an ideal exploitation environment (e.g. within Acrobat Engine). All API resolutions use API hashing. You should observe the following chain of events:

  1. Shellcode and some string decoding (e.g. URL http[:]//ladh[.]cz[.]cc/w/l.php?x=1001)
  2. urlmon load and resolution of URLDownloadToCacheFileA. Leverages the latter to access the URL and download malware
  3. Resolves CreateFileA and opens the cached file stored on the path provided by URLDownloadToCacheFileA
  4. Resolves and uses memory mapping functions (e.g. CreateFileMappingA, MapViewOfFile, UnmapViewOfFile) to map the downloaded malware in memory. FlushViewOfFile is used to flush any changes to the original file. The reason for this is not clear since, using a test binary, nothing changed. It is my understanding that URLDownloadToCacheFileA creates a temporary file and, by combining CreateFileA and memory mapping functions, the file is persisted (guessing here).
  5. Resolves library shell32 ShellExecuteA and runs the downloaded file.

 So nothing surprising here. Just exploit leveraged to download and execute malware. While a thorough analysis is welcome since we are in a position to tell what versions are affected, we can automate the shellcode extraction a bit. We will leverage jsunpack-n to extract the shellcode. jsunpack-n leverages Mozilla’s SpiderMonkey with some hooks (e.g. unescape, as i previously referred) to extract some useful information such as eval’d strings, shellcode, etc.  Let us see how it handles the PDF:

# Having the JS in file.
#-V means very verbose.
#-f means fast analysis. With fast evaluation. I recommend leaving this out so the analysis is deeper.
jsunpackn.py [JS Path] [-f] -V

For the command you should get something like:

jsunpack output
jsunpack output

 What jsunpack-n does in essence is to simulate internal variables (e.g. app) to trick the malicious JavaScript into thinking it is executing within an Adobe engine. Multiple version strings are tested sequentially. The shellcode was successfully extracted using the same unescape hooking. In this case, the tool was not able to tell what CVE this malicious PDF exploits. We saw that the versions accounted for were: 

  • >= 8 and <
  • >=8 and <

 Also, it seemed like the affected element was the Adobe Color Engine (ACE). VirusTotal comments refer Pidief and Pdfka. While searching for these i did not find any reference to ACE. However searching for ACE vulnerabilities holds: 

  • CVE-2010-3622: Adobe Reader and Acrobat 9.x before 9.4, and 8.x before 8.2.5 on Windows and Mac OS X (memory corruption)
  • CVE-2011-0598: Adobe Reader and Acrobat 10.x before 10.0.1, 9.x before 9.4.2, and 8.x before 8.2.6 on Windows and Mac OS X (integer overflow)

  In any case, we know the versions. As i referred, with a couple of steps and tools you can extract the JavaScript, the shellcode and understand what the latter does. You came here for some recipes, right? So, here it goes:

Problem1: What versions are exploited? What is the CVE?
Approach: If jsunpack-n fails and OSINT is not able to tell you based on the PDF hash, extract the javascript using one of the previous tools i referred, look for references to app. Decode strings if needed. Understand what engine objects are used to detect the CVE (e.g. util.printf() -> CVE-2008-2992).

Problem2: What does the shellcode do? 
Approach: Try jsunpacker-n and sctest/scDbg. sctest may fail if the shellcode contains instructions with ROP garbage. scDbg tends to be better since it attempts multiple execution paths and lets you choose which one to take. Worst case scenario, use shellcode2exe and debug the shellcode. Bear in mind that ROP code may hinder the analysis. Check the following section for an exotic example.

Problem3: What if it is not an exploit? What if is just a document with some tricks exploiting design? 
Approach: The Approach1 solves part of this. Leverage D.Stevens tools and peepdf to perform some reckon on the file. The command provided on this post should suffice. Look for suspicious tags, understand relations between objects. Look for embedded URLs, etc. 

 As a general note, specially for JS interpreters and automated deobfuscators, sometimes you may need to tweak the scripts a bit so they run. This has to be done on a case by case basis. 

Shellcode garbage and entrypoint

 While going through a couple of documents i have realised that running the shellcode as extracted from the documents is no simple task. Shellcode used on exploits have added instructions that assume some values on registers (e.g. addresses). Running the shellcode blindly will lead to access violations/crashes. On the presented case, the core functionality started a couple of lines after the “initial entrypoint”. This is not always the case. On some cases, the shellcode will search for tags (typically byte sequences) on the loaded document. This approach can be used to decode and deploy embedded malware or execute more shellcode. Take the example of: 721601bdbec57cb103a9717eeef0bfca. This PDF exploits CVE-2010-1297, an exploit that leverages a crafted SWF file. If you extract the shellcode, you should have something like: 

 A proper jmp, call or decoding loop are nowhere to be found. A strings on the shellcode shows a suspicious string: c:\-.exe. While wondering around the shellcode on IDA i spotted this:

Decoding WorkflowF.zh inside PDFJucK inside PDF

 Basically, the shellcode is looking for two offsets JucK and F.Zh. Whatever is between those offsets is XORed with other value. The vertical arrows indicate normal execution. You should set the EIP to the instruction at the beginning of those arrows and execute the code until the end (i.e. arrow tip). For the other arrow, you must execute the instructions until the beginning of it, jump (set EIP) to the instruction at their tip and proceed from there (0x401323->(after) 0x401336). 

 If you load the document somewhere in memory and follow the instructions above, you should be able to spot the following transformation (make sure you write down the value of ebx before and after the loop):


 The decoding trick employed by the shellcode is pretty clever since a byte stream is generated based on the lower hex byte for initial size of the section to be decoded (0xFC) and decrements at each loop. The byte stream is xored against each byte of the malware.  

 We have a broken PE file which is likely dropped as c:\-.exe. Knowing the ebx prior (iebx) and pos (febx) decryption, you can dump the memory section. Dump between [iebx, febx -1]. Now we open the malware on a hex editor (e.g. HexWorkshop) side-by-side with a well-formed exe. You will notice that the first readable character “@” appears at an offset of 18h on most exe files. Our malware has it on 14h. Also, there is no MZ at the beginning so we need to pad the beginning of the malware and add the MZ:

MZ Corrected
MZ Corrected

 With this simple correction i got a malware with the signature b7a7d58de7f0399f0a14a594df384fc9 recognised by tools like IDA and PE explorer. The malware is pretty simple and the imports and C2 seem to be decoded (some strings):

  • [qmgrConfig]
  • ServerAddress=http[:]//210[.]211[.]31[.]214/ddradmin/ddrh.ashx -> C2!!!!
  • SleepTime=1000
  • Guid=00000000-0000-0000-0000-000000000000
  • VendorIdentifier
  • ProcessorNameString
  • Identifier
  • Processors :
  • Group name : %s
  • Comments : %s
  • Members :
  • System Information:
  • Last Update Patches:
  • User Information and administrators group:
  • Network Resources:
  • Installed Applications:
  • Installed Services:
  • Browser Information:
  • DNS server(s): %s
  • Physical Address: %02x-%02x-%02x-%02x-%02x-%02x
  • Host Name: %s


 On this case, i managed to stumble upon a simple routine of XOR. I was able to skip the middle calls without affecting the final binary. Also, the binary was not broken beyond repair so a simple comparison with a standard header solved the problem. On more complex cases, it is better to simulate a vulnerable environment and debug the exploited application. I have tried to run this code using scDbg but could not see anything meaningful happening.

Final Thoughts

 On this post i attempted to tackle common difficulties when analysing malicious PDFs carrying exploit code. It is my understanding that the show stoppers for the analysis are:

  • What is exploited? Versions? CVE?
  • What does the shellcode do? Contacted domains? Host indicators?

 As far as i know, JavaScript is always leveraged for heap spraying purposes and exploit triggering (optionally), so that the exploit is more successful. As such, a good starting point is always the extraction and analysis of the JS. CVE identification is not a big issue for multiple reasons:

  • If the file is malicious, it is likely that OSINT can help you
  • You can analyse it manually and search for version checks
  • You can leverage jsunpacker-n and see if it can detect the CVE

  As for shellcode analysis, the simplest way is to either manually hook the typically used unescape (e.g. print before/after the call) or use tools like jsunpacker-n to get it for you. Once you have the binary shellcode, load it on a emulator or debugger. If the former does not work, and since shellcode has typically extra code that accounts for a proper exploitation environment, your best bet is to open it on IDA and try to locate primitives like: PUSHA, LOOP (and derivatives), JMP, PUSH EBP, etc. Also, the existence of random mov’s of hex strings before calls to functions may indicate API hashing. In the end is all about looking for what stands out. Patience is key here.

Stay safe 😉


2 thoughts on “(Not) All She Wrote (Part 1): Rigged PDFs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s