Language-Theoretic SecurityLanguage-theoretic security, or LangSec, is an approach to software security that focuses on input handling, complexity, and program design as strategies to improve the verifiability of computer programs. It was introduced in 2011 by Len Sassaman and Meredith L. Patterson.[1] It aims to create a formal description of which software is likely to have security vulnerabilities of particular classes, and why. It considers programs to have an inherent parser component, whether or not explicit, composed of that part of the program which operates on external input before that input is fully parsed. A central hypothesis of language-theoretic security is that vulnerabilities in software increase according to the computational power of the notional input-accepting automaton equivalent to this parser, using the definitions of automata theory. The lower bound on this computational power is the input language complexity of the program. The extent to which reducing this complexity is possible is a function of the specification of the communication protocol or file format the program takes as input. Parsing as a Security MechanismThe behaviour of a program is defined with reference to its expected input. Unexpected input being used by a program is a factor in numerous security bugs, including the so-called Android master key vulnerability (CVE-2013-4787),[2] because accepting unexpected input renders the program's specification ambiguous. In that instance, the unexpected ambiguity came in the form of a ZIP file with duplicate filenames. If a program fully parses its input and only acts on input that unambiguously meets the specification, it follows that the program will avoid these types of vulnerabilities. This is an intentional inversion of the Postel principle. Accepting only unambiguous and valid input is a more formal requirement than input validation or sanitization, and narrows the number of possible but unanticipated program states that can be induced in an application via user input. Conversely, failure to do this is associated with security vulnerabilities.[3] Input sanitization in particular is held to be an inadequate approach to avoiding malicious input because it inherently ignores context-sensitive properties of the input;[4] it can therefore result in paradoxical effects, such as sanitization code activating otherwise inert cross-site scripting payloads in browsers.[5] Parser DifferentialsIf the language of accepted program input is sufficiently simple, it is possible to verify that two implementations parse the same input language consistently. This is advantageous because it shows no parser differential exists between the two implementations. The requisite level of simplicity is theoretically that for which there is a solution to the equivalence problem. If the two parsers involved in CVE-2013-4787 were equivalent - that is, if they rendered the same output state given the same input state - the vulnerability could not have existed. One strategy for doing this is to publish machine-readable specifications of a format or protocol, and then use a parser generator to generate the parser code. An example of a parser generator built for this purpose is DaeDaLus.[6] The combination of Lex with any of GNU Bison, ANTLR, or Yacc also accomplishes this. However, many parser generators allow the mixing of general purpose code with the parsing definitions, which weakens the guarantees provided by parsing.[7] Analysis of Injection AttacksInjection attacks are generally the result of differences between the serializer (or "unparser") and the corresponding parser at a layer boundary in a system; therefore, they are a special case of parser differentials.[8] In a SQL injection attack, for example, an attacker is able to cause the application with which they are interacting to serialize a SQL query that has different semantics than intended. In the simplest case where the payload ends a string and adds new code, the payload has crossed the code-data boundary in SQL. In language-theoretic security, this is treated as a bug in the serializer of the SQL query, which should instead be written in a way that constrains its possible outputs to those within the scope of the intended query. Parser CombinatorsIf a parser generator is not used, it is still possible to avoid implementation bugs by using parser combinator such as Nom[9] to implement the parser code. This has the drawback of relying on a programmer correctly translating the specification into the language of the parser generator library, though this task is still less error-prone than hand-coding a parser.[10] Input Format ComplexityComplexity in computer programs is associated with security vulnerabilities.[11][12][13] Within the domain of language-theoretic security, complexity is described with reference to the computational power of the abstract machine necessary to implement the program, or more particularly, to implement the parser for its input language. This complexity describes whether it is possible to show that there is no unintended or undesired functionality in the program which might be exploitable by an attacker. To be bounded in complexity, the program's input must be well-defined both in terms of form and of semantics.[14] Weird MachinesA weird machine is a model of computation in a program that exists in parallel with, but is distinct from, the intended abstract model of computation in that program. Some classes of weird machine arise from the multi-layered nature of computer programs, or the context in which the programs run; others result from the unanticipated functionality a program has due to its complexity or to software bugs. The more complex the computation model of a program, the more likely it is to implement a weird machine. Depending on context, the weird machine may or may not be concretely useful for an attacker. Since the space of weird machines in the context of some program is the universe of all possible states that are not within the program's intended states, many exploited states including remote code execution[15] and injection attacks belong to the domain of weird machines. A reduction in weird machines is therefore a likely correlate with reduced program vulnerability. SafeDocs ProjectSafeDocs is a DARPA project undertaken in 2018 to take existing file formats, create safer subsets of them, and develop programming tools to work for the safer formats. The initial test case for this was PDF. The purpose of creating safer subsets in this case is to lower the minimum bound on parser complexity so that it becomes possible to create tools that will generate correct, normative parsers for them.[16] Relation to Programming LanguagesThe analytic framework of language-theoretic security assumes programs to be virtual machines that execute their input. A document that is read by an application is in this sense a form of machine code, in a generalization of the data as code idea, following the automata theory description of parsers. Type-Safe Programming LanguagesParsing input and serializing output are operations that consume one data type and emit another. A programming language can therefore check that data is correctly parsed and contains the expected structure by checking data types, and correct serializing (or unparsing) can be implemented as operations on the data types that are relevant to the program's output. This approach can be used to show that the recognizer and unparser patterns have been implemented. It is also possible to implement type checking across a distributed system to enforce parsing and unparsing of the expected structures and to verify that the assumptions made in designing the compositional properties of a distributed system have been followed.[17] Memory-Safe Programming LanguagesIn the general case, spatial memory correctness is undecidable. If any proof of spatial memory correctness is to be made, it is therefore necessary to bound the complexity of the code. Interpreted languages such as Java and Python effectively accomplish this via runtime bounds checking, and frameworks for runtime bounds checking also exist for C.[18] The effect of these strategies for spatial memory correctness are to create a halt state in place of a spatial memory correctness violation; therefore, it can be shown that the program will not violate spatial memory correctness, but in exchange, it cannot be shown in the general case that programs will not have runtime bounds checking exceptions. Some programming languages, such as Rust, accomplish this using borrow checking. The borrow checker acts to assure spatial memory correctness by compile-time reference counting. Code for which spatial memory correctness cannot be shown to not be violated therefore does not compile, inherently limiting the complexity of the spatial memory correctness of the program to what is decidable. This works to the extent that the borrow checker is correct, which has been shown for subsets of Rust but not the totality of Rust; therefore, while it is nominally "memory-safe", bugs that violate the temporal correctness of the memory model have been discovered.[19] Program AnalysisWhen the complexity of program input is limited such that it only requires a sub-Turing program to validate, the parser is more amenable to verification and static analysis. The programming language Crema, as an example, experimentally demonstrated that an SMTP protocol parser written in even a minimally sub-Turing programming language allowed the static analyzer KLEE to achieve more coverage while exploring less program states, as compared with the parser used in Qmail.[20] Programming PatternsRecognizer PatternThe recognizer pattern is a software design pattern described by the language-theoretic security community whereby a program's input is fully recognized by a parser before the program acts upon that input.[21] The result is that the post-parsing program logic can be expressed in terms of only valid program input, and therefore the number and complexity of cases that must be handled when checking the program for correctness can be bounded and reduced. Bounding the complexity of the output of the recognizer can be used as a strategy to render the program logic amenable to verification. Unparser PatternThe unparser pattern is the inverse of the recognizer, and is designed to constrain the program's output so that it cannot be used to exploit consumers of its data. This is a more detailed approach than output encoding; formal grammars are used to construct an output serializer that can only generate valid messages. Consumers of the program's output may therefore rely on the stated definition of output data regardless of the program's input, making it possible to reason about the compositional properties of the whole system. References
|