Metadata: Perception vs. Truth

6 min readJul 8, 2021

--

Photo Credit — @heyerlein (The perception of a form via smaller artefacts)

One small step to reduce your online exposure.

Understanding the basics

File-based metadata is defined as the attributes assigned to files that enable you to understand the file characteristics, excluding the content of a said file. File metadata enables file characterisation and is used by software, databases and other systems. (reference)

When a file is written to disk, attributes like usernames, software that created the file and timestamps are appended to the metadata attributes. These attributes at scale can identify internal systems, naming conventions, and potential exploits that red teams or malicious adversaries can leverage if the file is recent enough.

In my forensic investigation experience, actors have also opened files with their obtained credentials, edited the files for whatever reason and resaved said files, leading to a hypothesis on what types of data the actors were interested in.

User’s first and last name, software used, location of the file, hash etc., from a single pdf using Exiftool.

Real-world cases

Security researchers, pentest teams or red teams download, parse and analyse breached data from various sources for risk classification and exposure association. This data contains invaluable erudition that identifies:

People
Software; and
Infrastructure.

This data can also be used by malicious actors or classify a data risk due to the intentional or unintentional exposure of information.

One document or ten documents on face value ordinarily don't make correlations; Ten thousand or one hundred thousand documents can create a unique picture that can be used against an individual or organisation.

Collection of metadata against a medium to large scale business can easily result in thousands of documents via tools.

An example OSINT collection process could look like:

Target identification
Subdomain enumeration
Automated google dork collection
Site scraping
Wayback or archive site collection; and
Many other processes/collection methods.

Comparable processes like the above can result in thousands of files awaiting analysis.

The below code snippet is a simple function (within a larger program) that calls grayhatwarfare’s API and searches for keywords, to later download and parse the files. This type of enumeration/collection is used by malicious attackers to gain awareness of their target. Mitre reference.

#reading wordlist, to download all found files
function download_b_ids {
while read line
do
clear
echo "Initating seeking routine..... Target is "$line"...... Seeking .... (+)"
curl --connect-timeout 15 --max-time 15 "https://buckets.grayhatwarfare.com/api/v1/files/"$line"/0/1000?access_token=API" |python -mjson.tool |grep "\"url\"\:" |grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" >> $TMPID
#calling all buckets in 1000 incrementents
for a in {0..1000000..1000}
do
b=$(("$a" + 1000))
echo "Downloading - "$a" / "$b" access points using ID: "$line". Max 1000000 records tested"
curl --connect-timeout 3 "https://buckets.grayhatwarfare.com/api/v1/files/"$line"/"$b"/1000?access_token=API" |python -mjson.tool |grep "\"url\"\:" |grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" >> $TMPID
clear
done
#ending primary loop
done <"$metakeys"
}

The correlation of metadata can result in the identification of exploits. exploit-db

Metadata can also be obtained via data breaches. for example, an standard data breach process flow could look like:

Deduplication of files
Identification of the victim and potential naming conventions
Extraction of metadata at scale (usually results in thousands of unique metadata attributes)
Creation/use of broad custom rules to understand potential impacts
Creation/use of specific rules to narrow in on confidential/impacting data
Correlation of metadata > individual > confidential exposure (without metadata, the correlation of confidential data to the user is complex in some cases); and
Breach reporting.

Example of compromised files hosted by an actor on tor. It contains a wealth of metadata.

Point to note — A usual OSINT investigation against/for a medium size organisation results in approx. 6000 to 30000 unique files when parsed approx 400–1000 users (exluding open buckets).

Example pre-visualisation of metadata automation

Example Splunk search based on JSON formatted metadata; Ingested and visualised. Approx. 1700 files (1 domain(excluding subdomains)) with 542 software versions identified.

Weaponisation

The following list has examples of how metadata can be weaponised:

Names — Collect information about individuals and pivot to lures (related to context correlation)
Emails — Naming conventions. FN.LN@comp or LN.MN.FI@comp
Software — Correlate software (and version) to available exploits
Drive mappings — Understanding the internal network, file stores or locations that store confidential data; and
A myriad of other actions.

A simple correlation of names to emails enables social media scraping, generation of phishing lures, user/email to known breaches/passwords and the generator of targeted red team activities.

Point to note — APT’s and malicious actors use metadata collection in their targeting phases without question, so you should make it harder for them.

Photo Credit — @ffstop Thinking different about metadata.

I understand metadata, so what?

Metadata is used for forensic purposes that commonly results in target identification or correlation of a user/software with breached data. Unless your workflow/processes need metadata, eliminating metadata only benefits yourself or your organisation.

You cant remove your current metadata footprint without a significant effort, but you can change what you do from now on!

Businesses or individuals who publish data regularly should create a process flow that removes metadata from your files before exposing said files to the internet. Example: file saved > automated check (does file have metadata) > run metadata removal > push file to publish location > publish file.

For singular files or irregular metadata removal, ExifTool has an “easy to use” GUI that enables metadata removal, or a terminal version, below.

We are removing metadata in Linux with ExifTool.

If you take numerous screenshots for forensic purposes (like I do), you could use automated processes that eliminate metadata, as seen below.

ShareX Screenshot software, with post screenshot actions that remove metadata.

I also created a PowerShell script that automatically cleans and move files for automation purposes (reach out if you want it).

Automation is key!

Automating metadata collection, processing and deduplication should be your primary goal if exposure monitoring is your goal. I have created automated tooling and processes that achieve this goal, so reach out if you want to know more.

Metadata Counterintelligence

If you or your business wants to see if actors are using your metadata against you, plant fake metadata into your files! Simulated data enables the tracking and detecting malicious actors using your fake metadata for malicious purposes—a link for usage and understanding here.

#Example use of metadata counter intelligence
exiftool -overwrite_original -rights="©2021 Jan Doe, no rights reserved" -CopyrightNotice="©2021 Jane Doe, no rights reserved" "Published_policy.pdf"

Thanks for reading!
Most of the examples contained within this post are manual depictions of key points, but automation is the best way forward analysing data at scale.
I have been busy over the last 2–3 years to write content on a regular basis but I hope these small posts can reduce risk to yourself or your organisation.

All the above examples were created in VM’s; Kali, Tracelabs and custom windows boxes built on Threat-Pursuit.