IEEE Software Blog: July 2018

By: Cristina Cifuentes, Oracle Labs (@criscifuentes)
Associate Editor: Karim Ali, University of Alberta (@karimhamdanali)
Back in 1995, only 25 Common Vulnerabilities and Exposures (CVE) were reported.But by 2017, that number had blown out to 10,000+, as reported in the National Vulnerability Database (NVD). As the CVE list is a dictionary of publicly disclosed vulnerabilities and exposures, it should give an idea of the current situation.
However, it does not include all data, because some vulnerabilities are never disclosed. For example, they may be found internally and fixed, or they may be in cloud software that is actively upgraded through continuous integration and continuous delivery (CI/CD). Node.js has its own Node Security Platform Advisories system.
A five-year analysis (2013 to 2017) of the labelled data in NVD reveals that three of the top four most common vulnerabilities are issues that can be taken into account in programming language design:

5,899 buffer errors
5,851 injection errors¹
3,106 information leak errors

Combined, these three types of vulnerabilities represent 53% of all labelled exploited vulnerabilities listed in the NVD for that period, and they affect today’s mainstream languages.
We have known about exploitation of buffer errors since the Morris worm exploited a buffer error in the Unix finger server over 25. SQL injections² and XSS exploits³ have been documented in the literature since 1998 and 2000, respectively. A 2016⁴ study revealed that the average cost of a data breach is $4 million for an average of 10,000 records stolen through a vulnerability exploit. Not including the financial loss, imagine all the innovation that could happen if more than 50% of vulnerabilities did not exist!

Not only is the percentage of issues high, data on mainstream programming languages shows that none of them provide solutions to these three areas at the same time. A few languages provide solutions to buffer errors and/or injection errors, and no language provides a solution to information leaks. In other words, there are no mainstream secure languages that prevent these issues.

Why is this happening?

Let’s be clear about one thing – developers do not write incorrect code because they want to. It happens inadvertently, because our software programming languages do not provide the right abstractions to support developers in writing error-prone code.

Abstractions in programming languages introduce different levels of cognitive load. The easier it is for an abstraction to be understood, the more accepted that abstraction becomes. For example, managed memory is an abstraction that frees the developer from having to manually keep track of memory (both allocation and deallocation). As such, this abstraction is widely used in a variety of today’s programming languages, such as Java, JavaScript, and Python.
At the same time, performance of the developed code is also of interest. If an abstraction introduces a high performance overhead, it makes it hard for that abstraction to be used in practice in some contexts. For example, managed memory is not often used in embedded systems or systems programming due to its performance overhead.
At the root of our problem is the fact that many of our mainstream languages provide unsafe abstractions to developers; namely, manual management of pointers, manual string concatenation and sanitization, and manual tracking of sensitive data. These three abstractions are easy to use, they provide low performance overhead, but they are not easy to write correct code for. Hence, if used correctly, they have a high cognitive load on developers.
We need to provide safe abstractions to developers; ideally, abstractions that have low cognitive load and low performance overhead. I will briefly review three different abstractions for each of the vulnerabilities of interest, to show abstractions that could provide a solution in these areas.

1. Avoiding buffer errors through ownership and borrowing

Rust is a systems programming language that runs fast, prevents memory corruption, and guarantees memory and thread safety. Not only does it prevent buffer errors, it prevents various other types of memory corruptions, such as null pointers, and use after free. This feature is provided by the introduction of ownership and borrowing into the type system of the language:

Ownership is an abstraction used in C++ whereby a resource can have only one owner. Ownership with resource acquisition is initialiszation (RAII) ensures that whenever an object goes out of scope, its destructor is called and its owned resource is freed. Ownership of a resource is transferred (i.e., moved) through assignments or passing arguments by value. When a resource is moved, the previous owner can no longer access it, therefore preventing dangling pointers.
Borrowing is an abstraction that allows a reference to a resource to be made available in a secure way – either through a shared borrow (&T), where the shared reference cannot be mutated, or a mutable borrow (&mut T), where the shared reference cannot be aliased, but not both at the same time. Borrowing allows for data to be used elsewhere in the program without giving up ownership. It prevents use after free and data races.

Ownership and borrowing provide abstractions suitable for memory safety, and prevent buffer errors from happening in the code. Anecdotal evidence seems to suggest that the learning curve to pick up these abstractions takes some time, pointing to a high cognitive load.

2. Avoiding injection errors through taint tracking

Perl is a rapid prototyping language with over 29 years of development. It runs on over 100 platforms, from portables to mainframes. In 1989, Perl 3 introduced the concept of taint mode, to track external input values (which are considered tainted), and to perform runtime taint checks to prevent direct or indirect use of the tainted value in any command that invokes a sub-shell, or any command that modifies files/directories/processes, except for arguments to print and syswrite, symbolic methods and symbolic subreferences, or hash keys. Default tainted values include all command-line arguments, environment variables, locale information, results of some system calls (readdir(), readlink()), etc.
Ruby is a dynamic programming language with a focus on simplicity and productivity. It supports multiple programming paradigms, including functional, object-oriented, imperative, and reflective. Ruby extends Perl’s taint mode to provide more flexibility. Four safe levels are available, of which the first two are as per Perl:

No safety.
Disallows use of tainted data by potentially dangerous operations. This level is the default on Unix systems when running Ruby scripts as setuid.
Prohibits loading of program files from globally writable locations.
All newly created objects are considered tainted.

In Ruby, each object has a Trusted flag. There are methods to make an object tainted, check whether the object is tainted, or untaint the object (only for levels 0–2). At runtime, Ruby tracks direct data flows through levels 1–3; it does not track indirect/implicit data flows.
The taint tracking abstraction provides a way to prevent some types of injection errors with low cognitive load on developers. Trade-offs in performance overhead need to be made in order to determine how much data can be tracked and what target locations should be tracked, and whether direct and indirect uses can be tracked.

3. Avoiding information leaks through faceted values

Jeeves is an experimental academic language for automatically enforcing information flow policies. It is implemented as an embedded DSL in Python. Jeeves makes use of the faceted values abstraction, which is a data type used for sensitive values that stores within it the secret (high-confidentiality) and non-secret (low-confidentiality) values, guarded by a policy, e.g., <s | ns> (p). A developer specifies policies outside the code, and the language runtime enforces the policy by guaranteeing that a secret value may flow to a viewer only if the policies allow the viewer to view secret data.
Many applications today make use of a database. To make the language practical, faceted values need to be introduced into the database when dealing with database-backed applications. A faceted record is one that guards a secret and non-secret pair of values. Jacqueline, a web framework developed to support faceted values in databases, automatically reads and writes meta-data in the database to manager relevant faceted records. The developer can use standard SQL databases through the Jacqueline object relational mapping.
The faceted values abstraction provides a way to prevent information leaks, with low cognitive load on developers, but at the expense of performance overhead. This ongoing work is yet to determine the lower bound on performance overhead, in order to provide direct and indirect tracking of the data flows for leak of sensitive data purposes.

The future

The previous abstractions illustrate examples of ways to deal with specific types of errors through a programming language abstraction that may be implemented in the language’s type system and/or tracked in its runtime system. These abstractions provide great first steps at looking into the trade-offs of cognitive load and performance in our programming language abstractions, and to create practical solutions accessible to developers at large.
As a community, we need to step back and think of new abstractions to be developed that avoid high cognitive load on developers – and how to overcome any performance implications. Research in this area would allow for many new secure languages to be developed, languages that prevent, by construction, the existence of buffer errors, injection errors and information leaks in our software; i.e., over 50% of today’s exploited vulnerabilities. We need to improve our compiler technology, to develop new abstractions, and to cross the boundaries between different languages used in today’s cloud applications. The right secure abstraction for a web-based application that is database-backed may be different to the right secure abstraction needed for a microservices application.
With over 18.5 million software developers worldwide, security is not just for expert developers. It’s time to design the future of programming language security – it’s time for secure languages.

^{1. [Injection errors include Cross-Site Scripting (XSS), SQL injection, Code injection, and OS command injection.]↩}
^{2. [Phrack Magazine, 8(54), article 8, 1998.]↩}
^{3. [CERT “Malicious HTML Tags”, 2000.]↩}
^{4. [2016 Ponemon Cost of Data Breach Study.]↩}

Monday, July 16, 2018

It’s Time For Secure Languages

Why is this happening?

1. Avoiding buffer errors through ownership and borrowing

2. Avoiding injection errors through taint tracking

3. Avoiding information leaks through faceted values

The future