Avoid Malware From User-Submitted Content

By Martin Shelton

NB Martin wrote an update to some of the information in this post on Source.

When journalists ask readers to share newsworthy images, videos, audio, or documents, they increase their own vulnerability to attack. And if editors publish those files unaltered, they risk infecting their their readers as well.

This document outlines the malware concerns and mitigation techniques that The Coral Project and others should consider in order to reduce this risk.

What is malware, and why should it matter to news orgs?

Malware is software designed to allow a third party to obtain unauthorized access or otherwise make use of a user’s system. The term ‘malware’ is a catch-all term for all kinds of malicious software. This could include ransomware that encrypts users’ hard drives until they pay to have it unlocked; scripts that give a remote user administrative privileges; Trojans that mine users’ processing power and bandwidth as a part of a massive network of bots. For news organizations, serving files from third parties comes with the risk of spreading malware, and this has happened to news organizations in the past.

In early 2016, through downloads that slipped through third party ad networks, multiple news sites including the New York Times and BBC inadvertently served users advertisements laced with ransomware. Similarly, in January 2016 (after prompting users to turn off adblockers) Forbes also unknowingly infected their readers with malware. And malware researchers find hundreds of such examples each month. In fact, as I write this, malicious ads have just been discovered at Newsweek. It’s a very common occurrence.

Why should readers visit a site if they aren’t confident in its safety? At a time when trust in journalism is reported to be at an all-time low, this problem seems particularly relevant for news organizations.

How did I investigate this problem?

I gathered information on safely serving media files to news organizations through several sources. I first investigated how several comment systems and forums (e.g., Disqus, Kinja, phpbb, Discourse) serve media files. I had countless conversations with security specialists at hacker conferences (e.g., HOPE, DEFCON), on social media channels (e.g., on Twitter) and with the security community at tinfoil.press to learn about potential malware issues when serving media files. Finally, I interviewed 10 security specialists about how to prioritize malware threats. During these conversations, I explained the use cases behind our Ask and Talk tools. Before long, the specialists converged on many of the same solutions.

Let’s first discuss the specific problems that newsrooms and end users are likely to confront with malware.

What are the key concerns?

The rules of the malware game are pretty simple. The attacker needs a user to give permission to execute the malware, either by taking advantage of open permissions in another browser page, or by getting the user to execute the malware themselves.

Sometimes a file with malware is automatically downloaded and opened. Typically this involves getting users to click a link or visit a webpage that will redirect to a website that downloads malware (e.g., with malicious JavaScript or Flash). In a narrow range of instances, it will open automatically.

More commonly, malware files must be manually executed. Common approaches for getting users to open malicious files include

Using a web page to prompt users to automatically download a malicious file, and prompt users to open it.
Sending a custom phishing email to convince the user that they should open the attached file, typically by impersonating a trustworthy source. For example, one common attack is to send a work-related email that convinces a journalist that they may be interested in the material contained in an ordinary-looking .docx file or .pdf that in fact contains malware.
Getting users to download and open a file that is masquerading as a file type besides an executable. For Windows users, due to the operating system’s naming conventions, kitties.jpg.exe will typically display as kitties.jpg. When the user executes the file, it will display the kitties, while also infecting their machine.

Key file formats we should be most concerned about

Far and away, the two families of files we should be most concerned about are PDFs and Microsoft Office documents. While PDFs include a JavaScript launcher that can download malware apps, Office documents (e.g., .doc, .docx, .xlsx, .pptx) can launch macros that execute bad code. According to many of the security specialists with whom I spoke, as well as my own independent investigation, it would be very surprising if PDFs and Office documents were not the most common files embedded with malware in phishing attacks.

Other types of file formats can also cause real problems, for example

.svg files can contain JavaScript entries that can open new web pages, allowing the browser to download files, typically an executable file that can contain malware. Even in those cases, the user usually still needs to execute the file manually.
Flash, a web video standard, is riddled with security holes that make it convenient for attacks.

Countless formats can distribute malware. Fortunately, relatively few file formats are relevant to our use case.

How do we defend against these threats? I’ll briefly describe an overview of the strategy for defending against malware, and then describe more specific approaches for each one of the maneuvers we can take.

Overview of the anti-malware strategy

Because one of the key problems with malware is the possibility of a malicious file masquerading as another file, we first need to confirm the file type.

Security specialists typically recommended reencoding images and documents, breaking malware hidden within the file. Malware should be reencoded in a disposable virtual machine – a dedicated computer for reencoding, which does not receive file access to the rest of the system. The virtual machine is deleted immediately after it is done reencoding. Potentially useful file metadata should be exported prior to reencoding.

Finally, introducing anti-spam controls would diminish the number of links shared by users to sites containing malware.

For reasons I will explain below, using malware scanning tools and reencoding videos and audio may be secondary goals.

What you can do

CONFIRMING FILETYPES: BLACKLISTING, WHITELISTING

Before giving journalists or readers access to any files, we need to be able to identify file types, then blacklist and whitelist different formats.

The first few bytes of a file usually indicate the file type in hex signatures. These are sometimes called magic numbers. (See some examples of magic numbers here.)
A unix terminal tool – ‘file’ – can identify file types relatively easily. (e.g., ‘file path/to/your_document.docx’).
You can also check filetypes with libmagic (see https://github.com/ahupp/python-magic).

Do not serve anything that looks like an executable (.dmg or .exe), or a library. Only serve the file formats you are willing to accept after confirming their format.

If the file format doesn’t check out, throw it in your blacklist. If it does check out, we may still need to look closer at the file before serving it to users.

USING VIRTUAL MACHINES TO (VMs) REENCODE FILES

Malware can be designed to target the host system, as well as other users. The best way to protect the host system is to use virtual machines when dealing with media files. Consider a well-supported virtualization solution (e.g., Xen) to create disposable machines before reencoding files.

When we accept files from users, we should reencode them inside of a disposable VM, and pass only that file to a fresh VM before pushing the reencoded file to the front-end. We would kill both virtual machines after we have stored our protected file.

Find more details on setting up Xen for your development environment here.

RENDERING DOCUMENTS

Because PDFs and Microsoft Office files are among the most commonly used for distributing malware, it is important that we do not serve original PDFs or Office documents to users. We can instead serve documents as static images or in re-rendered within document readers such as DocumentCloud.

Within a virtual machine, you can pull content out of documents or simply convert PDFs and office documents to static image files.

A free and open source software suite, ImageMagick, can convert PDFs directly into static images (e.g., ‘convert your_document.pdf yourfile.png’). You can get ImageMagick here, and find more documentation about specific conversions here.
ImageMagick will not convert Office documents directly into static images. Many types of Office files (.docx, .pptx, etc.) can be converted to PDFs (e.g., with unoconv) and then converted to static images. You can download unoconv and view sample conversions here.
You can also export content from documents (e.g., with Apache Tika) within a virtual machine before serving it to users. See some examples and get started here.
DocumentCloud^[a] and Google Docs also allow you to render documents without launching them. I recommend investigating their security documentation if you want to go that route.

REENCODING IMAGES

You can also reencode images with ImageMagick. Most of the specialists recommended straining potentially malicious binaries by simply reencoding the image. This can be accomplished by converting it to another format.

BEING SMART ABOUT METADATA

It’s important to note that reencoding images often removes EXIF metadata with details about how and where the image was taken. The image above for example includes metadata about the longitude and latitude where the photo was taken, the kind of camera used to take it, the date it was taken, among other details. Those details can be viewed here. This kind of information can be vital for journalists trying to verify the legitimacy of an image.
Depending on how an image is reencoded, the resultant image may or may not contain the relevant metadata. Typically ImageMagick will lose certain EXIF metadata when an image is converted from one format to another (e.g., .jpg -> .png), but can usually retain metadata if the image is reencoded as the same format (.jpg -> .jpg). The composition of the file (its binary) will be different, but the metadata will be the same.
Therefore, a system built for journalistic use should consider an option to export/retain metadata before reencoding the image, serving that data alongside the final reencoded image.
Metadata can also be exported from documents within a virtual machine before serving it to users. You can extract metadata from several formats with Apache Tika.

ANTI-SPAM MEASURES

Having anti-spam controls help to diminish the potential for spreading malware through links to external sites.

A good place to start is through using a verification system such as reCAPTCHA. It won’t stop people from posting links to sites with malware, but it will make it impractical to post mass malware links via bots. See the reCAPTCHA developer guide here.

Popular browsers already do some work to prevent users from opening malicious links. Google Chrome, Mozilla Firefox, and Apple’s Safari use Google’s Safe Browsing Sites blacklist to warn users about web pages that contain signs of malware or phishing.

Other concerns

Audio

The experts I spoke to did not identify any glaring vulnerabilities associated with commonly used audio files. However, it’s a trivially easy to prevent this threat. We could similarly address the remote potential of problems with audio files by reencoding them in the same manner as described above. For example, a piece of open source audio software called Vorbis could be used to convert an .mp3 to a .wav file before serving it to the journalist or listener.

Video

The security specialists I spoke to did not identify significant video exploits, and conclude that videos are unlikely to launch malware on their own. The bigger problem related to video is for unpatched media players or video encoding to contain exploits, and while that’s a very serious problem, it falls outside of the domain of The Coral Project. In the case of a remote possibility for malware concerns with videos, video players that reencode files can be helpful.

Virus scanning databases

Using virus databases (e.g., the VirusTotal API) would also give a system a slight edge, but could be a costly maneuver in terms of CPU resources, particularly for users of Coral tools. Virus scanners are also not foolproof to savvy attackers.

Virus scanners generally work by comparing a hash of a file to a known infected file. The hash is simply a long string of letters and numbers that corresponds to the file binary (e.g., 0491f4e55158d745fd1653950c89fcc9b37d3c1102680bd3ce67616a36bb2592 – this example is a hash for a malicious file. You can look it up in the VirusTotal database by searching for the hash, which will produce an analysis of the file.)

The problem with this approach is that changing only a small chunk of the binary may not break the file, yet will produce a different hash, allowing a savvy attackers to bypass virus scans. In other words, when it comes to virus scanners, your mileage may vary.

It’s important to point out that there are potential privacy implications for submitting file hashes to a publicly accessible virus database. It’s possible that an attacker could identify a file based on its hash. That would likely require the attacker to have access to the file in the first place. There are few instances where this would be an issue for us, but it’s something to keep in mind.

ALSO WORTH NOTING

ImageMagick is a robust, free and open source suite for creating and manipulating images. ImageMagick came up repeatedly during interviews for solving reencoding problems of all kinds, but it’s not perfect.

ImageMagick has suffered occasional exploits. For example, one exploit (see ImageTragick) gives an attacker remote code execution. Checking the magic numbers for your files will help prevent unwanted malicious files from being processed in the first place. Also, you should always keep your version updated.
Specific advice from Cooper Quintin of the Electronic Frontier Foundation: Compile flags with Address Space Layout Randomization (ASLR) to make it harder to exploit a buffer overflow with ImageMagick.

Other approaches

Disqus

Engineers on the comment platform Disqus have set Content Security Policy (CSP) headers inside embeddables, which they say prevents JavaScript from loading. They also state that they reencode and whitelist specific image formats.

Discourse

The open-source forum platform Discourse compresses uploads, and then serves a different version to community members.

When users want to copy and paste a document from inside a. docx file, the platform transforms the document into a static image. This helps to preserve formatting in a document and share it, without ever downloading or converting a .docx.

Qubes OS

Qubes uses the RGB values of a PDF’s pixels in a virtual machine to recreate the file.

Conclusion

Keeping readers and journalists safe from malicious attacks is a challenge. The malware mitigation techniques described here are no substitute for standard practices, such as keeping all system software updated. But by taking basic steps, platforms and news organizations can help make digital attacks much more difficult.

Acknowledgements

Special thanks for guidance from…

Runa Sandvik, Director of Security, Newsroom, at the New York Times
Mike Tigas, Hacker-Journalist at ProPublica
Harlo Holmes, Security Trainer at the Freedom of the Press Foundation
Jason Hernandez, Reporter at North Star Post
Security friends at Disqus (Brian Falldin, Burak Yiğit Kaya, Jason Yan)
Ramana Rao, Head, Livefyre Engineering
Cooper Quintin, Security Researcher at the Electronic Frontier Foundation
Tom Lowenthal, Staff Technologist at the Committee to Protect Journalists
Micah Lee, Technologist and Journalist at The Intercept

[a] Related note from Mike Tigas at ProPublica: DocumentCloud re-renders to uploaded documents in their own viewer. If you want to plug into DocumentCloud, it might be smart to talk to Ted Han first, but you could investigate their backend code to see how they handle it.

Cover photo by Christiaan Colen [CC BY-SA 2.0]

Martin Shelton was a Knight-Mozilla OpenNews Fellow with The Coral Project. He earned his PhD in Information and Computer Science at UC Irvine, specializing in journalism and computer security. He now works at Google.