Basic Document Analysis By Example: Sample 1


The article has three sub-goals around the primary goal of document analysis. First is to trace through a PDF and extract the docm file the PDF drops. The second sub-goal is to inspect the docm file using freely available tools. The final sub-goal is to generate yara rules to flag sample identifying strings in Object Linking and Embedding (OLE) steams that we can apply to other samples.

Disclaimer: I do my best to analyze in a vacuum. It helps greatly with self-improvement. While I will do research on things like file extraction, I strive to avoid looking up anything directly related to the sample itself. For others who have written their own analysis, hopefully we arrived at the same conclusion, but do I welcome any and all feedback and corrections.


In this article we will cover PDF analysis using some command line tools. We will start with the basics and drill down to get the embedded file. While we can skip directly to the file with ease (automation example below), let's follow the trail through the PDF file and learn a few things on the way.

Sample: 0ae3329bd7d8a4f61b28508194139115176be6bc546a682663993f12683a00e2

PDF Analysis

Start with acquiring a general overview of the PDF using peepdf [1]. The goal here is to find a place to start the analysis by identifing suspicious elements.

peepdf <name_of_file>

Code Snippet 1 - peepdf document stats

Screen Shot 1 - peepdf document stats

Notice the suspicious elements listed at the bottom which we define here.


  • OpenAction: " A value specifying a destination to be displayed or an action to be performed when the document is opened." [2]
  • Names: "can also specify and launch scripts or actions." [3]
  • JS: "A text string or stream containing a JavaScript script to be executed when the action is triggered." [4]
  • JavaScript: "Causes a script to be compiled and executed by the JavaScript interpreter." [5]
  • EmbeddedFiles: "A name tree mapping name strings to file specifications for embedded file streams" [6]
  • EmbeddedFile: "An embedded file stream" [7]

The number in () is the count of the object found. The number in [] is the object number; think of it as the key to looking up the object for further inspection.

Let's start with the OpenAction, as this is executed upon the document opening. Using peepdf's interactive mode, open the file and go to the /OpenAction object

peepdf -i <name_of_file>
PPDF> object 14

Code Snippet 2 - peepdf interactive mode

Screen Shot 2 - peepdf interactive mode

We find javascript calling another javascript function called submarine().

Next search the document to find all references to submarine.

PPDF> search submarine

Code Snippet 3 - peepdf search for string

Screen Shot 3 - peepdf search for string

Notice this matches the Objects with JS Code from the file summary. As object 14 is the OpenAction object number, let's look at 6. First run the info command to confirm type.

PPDF> info 6

Code Snippet 4 - peepdf info on object 6

Screen Shot 4 - peepdf info on object 6

Notice JSCode is Yes, so use the following command:

PPDF> js_beautify object 6

Code Snippet 5 - peepdf render object 6 as javascript

This command will render the JavaScript in a format that is easier to read. Not going to show the full JavaScript, but here is an interesting snippet:

Screen Shot 5 - peepdf render object 6 as javascript

Searching for exportDataObject on Adobe's help site [8], we learn that it will extract the document and, depending on the value of nLaunch, it will open it. Notice the value of dis is 2, thus the file will be saved to a temporary location and launched.

Next, locate this document in the pdf.

PPDF> search EWLKGTB.docm

Code Snippet 6 - peepdf search for EWLKGTB.docm

Screen Shot 6 - peepdf search for EWLKGTB.docm

This shows us the list of objects that contain that string. Object 6 is the one that we were just looking at, so let's look at 5 and 12.

PPDF> object 5

Code Snippet 7 - peepdf inspect object 5

Boom! On our first hit we find what we are looking for. According to the PDF specifications, /Filespec is "The dictionary form of file specification" [9]. Enumerating the fields, /EF holds the dictionary object whose value points to the stream of the embedded object.

Screen Shot 7 - peepdf inspect object 5

So object 4 should hold the embedded document.

PPDF> info 4

Code Snippet 8 - peepdf info on object 4

Screen Shot 8 - peepdf info on object 4

Between the Size of the object and Object type of stream, this looks like it's the embedded file's stream.

Notice:[4] corresponds to the embedded file object number seen in the file summary.

Inspecting the stream directly we can see the hex values.

PPDF> stream 4

Code Snippet 9 - peepdf render object 4 as a stream

Screen Shot 9 - peepdf render object 4 as a stream

The signature [10] at the start looks like a Microsoft Office Word Open XML formatted document which (conveniently enough) corresponds to the docm extention.

Switching tools, extract the stream to a file using pdf-parser [11]. -o 4 -f -d - <name_of_file> > EWLKGTB.docm

Code Snippet 10 - pdf-parser extract embedded file

Breaking this command down we have:

  • -o 4: Object Flag + Object Number
  • -f: Applies the filter for the Object type. Does this automagically provided the tool has support for the underlying stream type. Object 4 in this case uses FlateDecode which pdf-parser supports.
  • -d -: Dump the stream. The hypen after the -d stays to dump it to stout (the console screen in this case).

We take the output stream and redirect it to a file and confirm the file type using file.

file EWLKGTB.docm

Code Snippet 11 - Using file, inspect the header for the file type.

Screen Shot 10 - Using file, inspect the header for the file type.

Cool, we now have an Office doc to look at!

Before looking at the Office document, we are going to cover how to automate locating the embedded object's number and extracting the corresponding stream.

Automating Embedded File Extraction

As fun as it is to extract embedded files manually, we can hook into these two libraries to create a script to automate the extraction.

The peepdf library is relatively easy to use. The code snippet below will get the object number for each /EmbeddedFile element found in each verson.

  from peepdf.PDFCore import PDFParser

  pdf_parser = PDFParser()
  ret, pdf = pdf_parser.parse(self._file, True, False, False)
  stats_dict = pdf.getStats()
  embedded_files = []
  for version in stats_dict.get('Versions'):

Code Snippet 12 - Hooking peepdf to get list of embedded documents.

Hooking into requires a little more work.

First, download into the same directory as the automation script. Make sure to rename it as as the hyphen makes it a challenge to import.

  from pdfparser import cPDFParser, PrintObject, PDF_ELEMENT_INDIRECT_OBJECT
except Exception as ex:
  print "Didier Stevens can't found."
  print ""
  print "  Remember to rename to so it can be imported."

Code Snippet 13 - Importing pdf-parser

Next we need to create a class to emulate the options object created from optparse. We only need the options our code will call, so no need to create a property for each.

class OptParseEmulator(object):
    Summary: Create an object with the parameters's pdf-parser expects for
        extracting embedded documents.

    filter = [True|False]
    generate = [True|False]
    verbose = [True|False]
    extract = filename to extract malformed content to
    object = id of indirect object to select (version independent)
    dump = filename to dump stream content to
    nocanonicalizedoutput = [True|False]
    debug = [True|False]
    hash = [True|False]
    content = [True|False]

  def __init__(self, _filter, generate, verbose, extract, _object, dump,
               nocanonicalizedoutput, debug, _hash, content):
    self.filter = _filter
    self.generate = generate
    self.verbose = verbose
    self.extract = extract
    self.object = _object
    self.dump = dump
    self.nocanonicalizedoutput = nocanonicalizedoutput
    self.debug = debug
    self.hash = _hash
    self.content = content

Code Snippet 14 - Emulating pdf-parser's optparse object

Next we can wrap the calls into a function. The function's code (below) mimics how it's written in pdf-parser's main method with the other options removed.

def embedded_extraction(self, options):
      Lifted just what was needed from to extract file
    oPDFParser = cPDFParser(self._file, options.verbose, options.generate)
    #selectIndirectObject = True
    while True:
      _object = oPDFParser.GetObject()
      if _object != None:
        if _object.type == PDF_ELEMENT_INDIRECT_OBJECT:
          if == eval(options.object):
            PrintObject(_object, options)

Code Snippet 15 - Calling pdf-parser's document extraction logic

Ok, now we are ready to call the code. Remember to pass in the object's id using an instance of OptParserEmulator.

for embedded_id in embedded_files:
  # Emulate the options passed to Didier Stevens'
  options = OptParseEmulator(True, False, True, None, str(embedded_id),
                             'extracted_'+str(embedded_id)+'.file', False,
                             False, False, False)
  # Extract file

Code Snippet 16 - Setup complete - Extract embedded documents.

A full example, written as a command line tool, can be found here. It's only been tested with a handful of samples, so no guarantee to work across all PDFs.

Word Document Analysis

Let's move onto analyzing the word document. The tools we are going to use are:

Run olevba (part of oletools) to extract any VBA code along with a summary of suspicious items found in the sample. Below is the summary from EWLKGTB.docm

olevba EWLKGTB.docm

Code Snippet 17 - Inspecting document with olevba

Screen Shot 11 - olevba summary

Another popular tool is oledump. Running the tool against EWLKGTB.docm, we can see the different OLE objects. The ones with an m or M next to them are macros. EWLKGTB.docm

Code Snippet 18 - Inspecting document with oledump

Screen Shot 12 - oledump summary

Not going to go through the code line by line, but we will do a summary and create a yara rule to help identify the document from its VBA stream.

VBA Analysis


Reviewing the code, it downloads a file from one of these four URLs and saves it as ratchet20.exe under the temp directory. Then, using shell.Application, it opens the exe.

The VBA code is also obfuscated by using text, tag, and caption properties of items on a VBA form. For example, it extracts the value from a combobox (ffrrggbb) tag property and splits it into an array.

AsStringName = Split(Window1.ffrrggbb.Tag, "LACHET")
Vaucher = AsStringName(FreshID + FreshID * 2 / 13)

Code Snippet 19 - VBA snippet parsing some strings for use later

Screen Shot 13 - Parsing the string using Python

We can still make out its purpose even with the obfuscation.

Purpose: A downloader that drops a file named ratchet20.exe to %TEMP%.

Extracting the URLs

To extract the URLs, we can use oledump to dump all of the streams (-s a) and render the VBA code (-v) and look for the line containing outb. Happened to see the URLs when reviewing the VBA code and that's why we are filtering on outb. -s a -v | grep -i outb

Code Snippet 20 - Filtering down to the line containing the URLs

Screen Shot 14 - Filtering down to the line containing the URLs

Suspect that the letter V is what is being used to split the URLs. We can confirm this by looking at it in Word.

Screen Shot 15 - Checking the value of Window1.Command.Caption

Taking this knowledge, parse the URLs.

Screen Shot 16 - Extracting the URLs using bash cmd tools

PowerShell example:

Screen Shot 17 - Extracting the URLs using PowerShell on Linux

That's it! The URLs are extracted.

Create Yara Rules

Another feature of oledump allows us to use yara rules against OLE streams. Using the URLs, create a yara rule.

rule ewlkgtb_urls
        Author = "XOR Hex"
        $url1 = ""
        $url2 = ""
        $url3 = ""
        $url4 = ""
        1 of ($url1, $url2, $url3, $url4)

Code Snippet 21 - Yara rule to search for the URLs

Upon running it we see:

Screen Shot 18 - Yara rule results

Simple enough. Let's add two more rules to see where the user-agent value and dropper file name are located.

rule ewlkgtb_urls
        Author = "XOR Hex"
        $url1 = ""
        $url2 = ""
        $url3 = ""
        $url4 = ""
        1 of ($url1, $url2, $url3, $url4)

rule ewlkgtb_useragent
        Author = "XOR Hex"
        $useragent = "Mozilla/5.2 (Windows NT 6.2; rv:50.2) Gecko/20200103 Firefox/50.2"

rule ewlkgtb_droppername
        Author = "XOR Hex"
        $ratchet = "ratchet"

Code Snippet 22 - Yara rule to search for the URLs, user-agent, and the name of the downloaded file

Screen Shot 19 - Yara rule results

Both the user-agent value and the "save as" file name are stored on a VBA form (A15). The user-agent is stored in the Window1.SpinButton1.Tag1 property and the file name is parsed out of Window1.ffrrggbb.Tag (a combobox).


In today's article we reviewed how to:

  • Follow a PDF's OpenAction
  • Extract an embedded file
  • Automate the embedded file extraction
  • Use tools to look into malicious documents
  • Create yara rules to locate identifing strings inside of OLE streams

  1. PeePdf ↩︎

  2. Adobe Help Live Docs ↩︎

  3. Analyzing MAlicious Documents ↩︎

  4. Adobe Help Live Docs ↩︎

  5. Adobe Help Live Docs ↩︎

  6. Adobe Help Live Docs ↩︎

  7. Adobe Help Live Docs ↩︎

  8. Adobe PDF Reference ↩︎

  9. Adobe PDF Reference ↩︎

  10. File Extension Seeker ↩︎

  11. pdf-parser ↩︎

  12. oletools ↩︎

  13. oledump ↩︎

comments powered by Disqus