Language-Theoretic Security

Language-theoretic security, or LangSec, is an approach to software security that focuses on input handling, complexity, and program design as strategies to improve the verifiability of computer programs. It was introduced in 2011 by Len Sassaman and Meredith L. Patterson.^[1] It aims to create a formal description of which software is likely to have security vulnerabilities of particular classes, and why. It considers programs to have an inherent parser component, whether or not explicit, composed of that part of the program which operates on external input before that input is fully parsed. A central hypothesis of language-theoretic security is that vulnerabilities in software increase according to the computational power of the notional input-accepting automaton equivalent to this parser, using the definitions of automata theory. The lower bound on this computational power is the input language complexity of the program. The extent to which reducing this complexity is possible is a function of the specification of the communication protocol or file format the program takes as input.

Parsing as a Security Mechanism

The behaviour of a program is defined with reference to its expected input. Unexpected input being used by a program is a factor in numerous security bugs, including the so-called Android master key vulnerability (CVE-2013-4787),^[2] because accepting unexpected input renders the program's specification ambiguous. In that instance, the unexpected ambiguity came in the form of a ZIP file with duplicate filenames.

If a program fully parses its input and only acts on input that unambiguously meets the specification, it follows that the program will avoid these types of vulnerabilities. This is an intentional inversion of the Postel principle. Accepting only unambiguous and valid input is a more formal requirement than input validation or sanitization, and narrows the number of possible but unanticipated program states that can be induced in an application via user input. Conversely, failure to do this is associated with security vulnerabilities.^[3]

Input sanitization in particular is held to be an inadequate approach to avoiding malicious input because it inherently ignores context-sensitive properties of the input;^[4] it can therefore result in paradoxical effects, such as sanitization code activating otherwise inert cross-site scripting payloads in browsers.^[5]

Parser Differentials

If the language of accepted program input is sufficiently simple, it is possible to verify that two implementations parse the same input language consistently. This is advantageous because it shows no parser differential exists between the two implementations. The requisite level of simplicity is theoretically that for which there is a solution to the equivalence problem. If the two parsers involved in CVE-2013-4787 were equivalent - that is, if they rendered the same output state given the same input state - the vulnerability could not have existed.

One strategy for doing this is to publish machine-readable specifications of a format or protocol, and then use a parser generator to generate the parser code. An example of a parser generator built for this purpose is DaeDaLus.^[6] The combination of Lex with any of GNU Bison, ANTLR, or Yacc also accomplishes this. However, many parser generators allow the mixing of general purpose code with the parsing definitions, which weakens the guarantees provided by parsing.^[7]

Analysis of Injection Attacks

Injection attacks are generally the result of differences between the serializer (or "unparser") and the corresponding parser at a layer boundary in a system; therefore, they are a special case of parser differentials.^[8] In a SQL injection attack, for example, an attacker is able to cause the application with which they are interacting to serialize a SQL query that has different semantics than intended. In the simplest case where the payload ends a string and adds new code, the payload has crossed the code-data boundary in SQL. In language-theoretic security, this is treated as a bug in the serializer of the SQL query, which should instead be written in a way that constrains its possible outputs to those within the scope of the intended query.

Parser Combinators

If a parser generator is not used, it is still possible to avoid implementation bugs by using parser combinator such as Nom^[9] to implement the parser code. This has the drawback of relying on a programmer correctly translating the specification into the language of the parser generator library, though this task is still less error-prone than hand-coding a parser.^[10]

Input Format Complexity

Complexity in computer programs is associated with security vulnerabilities.^[11]^[12]^[13] Within the domain of language-theoretic security, complexity is described with reference to the computational power of the abstract machine necessary to implement the program, or more particularly, to implement the parser for its input language. This complexity describes whether it is possible to show that there is no unintended or undesired functionality in the program which might be exploitable by an attacker.

To be bounded in complexity, the program's input must be well-defined both in terms of form and of semantics.^[14]

Weird Machines

A weird machine is a model of computation in a program that exists in parallel with, but is distinct from, the intended abstract model of computation in that program. Some classes of weird machine arise from the multi-layered nature of computer programs, or the context in which the programs run; others result from the unanticipated functionality a program has due to its complexity or to software bugs.

The more complex the computation model of a program, the more likely it is to implement a weird machine. Depending on context, the weird machine may or may not be concretely useful for an attacker. Since the space of weird machines in the context of some program is the universe of all possible states that are not within the program's intended states, many exploited states including remote code execution^[15] and injection attacks belong to the domain of weird machines. A reduction in weird machines is therefore a likely correlate with reduced program vulnerability.

SafeDocs Project

SafeDocs is a DARPA project undertaken in 2018 to take existing file formats, create safer subsets of them, and develop programming tools to work for the safer formats. The initial test case for this was PDF. The purpose of creating safer subsets in this case is to lower the minimum bound on parser complexity so that it becomes possible to create tools that will generate correct, normative parsers for them.^[16]

Relation to Programming Languages

The analytic framework of language-theoretic security assumes programs to be virtual machines that execute their input. A document that is read by an application is in this sense a form of machine code, in a generalization of the data as code idea, following the automata theory description of parsers.

Type-Safe Programming Languages

Parsing input and serializing output are operations that consume one data type and emit another. A programming language can therefore check that data is correctly parsed and contains the expected structure by checking data types, and correct serializing (or unparsing) can be implemented as operations on the data types that are relevant to the program's output. This approach can be used to show that the recognizer and unparser patterns have been implemented.

It is also possible to implement type checking across a distributed system to enforce parsing and unparsing of the expected structures and to verify that the assumptions made in designing the compositional properties of a distributed system have been followed.^[17]

Memory-Safe Programming Languages

In the general case, spatial memory correctness is undecidable. If any proof of spatial memory correctness is to be made, it is therefore necessary to bound the complexity of the code. Interpreted languages such as Java and Python effectively accomplish this via runtime bounds checking, and frameworks for runtime bounds checking also exist for C.^[18] The effect of these strategies for spatial memory correctness are to create a halt state in place of a spatial memory correctness violation; therefore, it can be shown that the program will not violate spatial memory correctness, but in exchange, it cannot be shown in the general case that programs will not have runtime bounds checking exceptions.

Some programming languages, such as Rust, accomplish this using borrow checking. The borrow checker acts to assure spatial memory correctness by compile-time reference counting. Code for which spatial memory correctness cannot be shown to not be violated therefore does not compile, inherently limiting the complexity of the spatial memory correctness of the program to what is decidable. This works to the extent that the borrow checker is correct, which has been shown for subsets of Rust but not the totality of Rust; therefore, while it is nominally "memory-safe", bugs that violate the temporal correctness of the memory model have been discovered.^[19]

Program Analysis

When the complexity of program input is limited such that it only requires a sub-Turing program to validate, the parser is more amenable to verification and static analysis. The programming language Crema, as an example, experimentally demonstrated that an SMTP protocol parser written in even a minimally sub-Turing programming language allowed the static analyzer KLEE to achieve more coverage while exploring less program states, as compared with the parser used in Qmail.^[20]

Programming Patterns

Recognizer Pattern

The recognizer pattern is a software design pattern described by the language-theoretic security community whereby a program's input is fully recognized by a parser before the program acts upon that input.^[21] The result is that the post-parsing program logic can be expressed in terms of only valid program input, and therefore the number and complexity of cases that must be handled when checking the program for correctness can be bounded and reduced. Bounding the complexity of the output of the recognizer can be used as a strategy to render the program logic amenable to verification.

Unparser Pattern

The unparser pattern is the inverse of the recognizer, and is designed to constrain the program's output so that it cannot be used to exploit consumers of its data. This is a more detailed approach than output encoding; formal grammars are used to construct an output serializer that can only generate valid messages. Consumers of the program's output may therefore rely on the stated definition of output data regardless of the program's input, making it possible to reason about the compositional properties of the whole system.

References

^ Sassaman, Len; Patterson, Meredith L. (2011-02-17), Towards a formal theory of computer insecurity: a language-theoretic approach (video), retrieved 2025-05-26
^ Wang, Haoyu; Liu, Hongxuan; Xiao, Xusheng; Meng, Guozhu; Guo, Yao (November 2019). "Characterizing Android App Signing Issues". 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE. pp. 280–292. doi:10.1109/ASE.2019.00035. ISBN 978-1-7281-2508-4. ISSN 2643-1572.
^ Ali, Sameed; Anantharaman, Prashant; Lucas, Zephyr; Smith, Sean W. (May 2021). "What We Have Here Is Failure to Validate: Summer of LangSec". IEEE Security & Privacy. 19 (3): 17–23. Bibcode:2021ISPri..19c..17A. doi:10.1109/MSEC.2021.3059167. ISSN 1540-7993.
^ Poll, Erik (24 May 2018). "LangSec Revisited: Input Security Flaws of the Second Kind". 2018 IEEE Security and Privacy Workshops (SPW). IEEE. pp. 329–334. doi:10.1109/SPW.2018.00051. ISBN 978-1-5386-8276-0.
^ "Major IE8 flaw makes 'safe' sites unsafe". Archived from the original on 2025-04-30. Retrieved 2025-06-06.
^ Diatchki, Iavor S.; Dodds, Mike; Goldstein, Harrison; Harris, Bill; Holland, David A.; Razet, Benoit; Schlesinger, Cole; Winwood, Simon (2024-06-20). "Daedalus: Safer Document Parsing". Proceedings of the ACM on Programming Languages. 8 (PLDI): 816–840. doi:10.1145/3656410. ISSN 2475-1421.
^ Bangert, Julian; Zeldovich, Nickolai (May 2014). "Nail: A Practical Interface Generator for Data Formats". 2014 IEEE Security and Privacy Workshops. IEEE. pp. 158–166. doi:10.1109/SPW.2014.31. ISBN 978-1-4799-5103-1.
^ Sassaman, Len; Patterson, Meredith L.; Bratus, Sergey; Locasto, Michael E. (4 July 2013). "Security Applications of Formal Language Theory". IEEE Systems Journal. 7 (3): 489–500. Bibcode:2013ISysJ...7..489S. doi:10.1109/JSYST.2012.2222000. ISSN 1932-8184.
^ Couprie, Geoffroy (May 2015). "Nom, A Byte oriented, streaming, Zero copy, Parser Combinators Library in Rust". 2015 IEEE Security and Privacy Workshops. IEEE. pp. 142–148. doi:10.1109/SPW.2015.31. ISBN 978-1-4799-9933-0.
^ Isradisaikul, Chinawat; Myers, Andrew C. (2015-08-07). "Finding counterexamples from parsing conflicts". ACM SIGPLAN Notices. 50 (6): 555–564. doi:10.1145/2813885.2737961. ISSN 0362-1340.
^ "A Plea for Simplicity". Schneier on Security. Retrieved 2025-06-04.
^ Geer Jr., Daniel E. (November 2008). "Complexity Is the Enemy". IEEE Security & Privacy. 6 (6): 88. Bibcode:2008ISPri...6f..88G. doi:10.1109/MSP.2008.139. ISSN 1558-4046.
^ Hoffer, Gregory (30 April 2023). "Complexity Is Still the Enemy of Security". Cyber Defense Magazine.
^ Momot, Falcon; Bratus, Sergey; Hallberg, Sven M.; Patterson, Meredith L. (November 2016). "The Seven Turrets of Babel: A Taxonomy of LangSec Errors and How to Expunge Them". IEEE Cybersecurity Development. IEEE: 45–52. doi:10.1109/SecDev.2016.019. ISBN 978-1-5090-5589-0.
^ Dullien, Thomas (2020-04-01). "Weird Machines, Exploitability, and Provable Unexploitability". IEEE Transactions on Emerging Topics in Computing. 8 (2): 391–403. Bibcode:2020ITETC...8..391D. doi:10.1109/TETC.2017.2785299. ISSN 2168-6750.
^ "SafeDocs: Safe Documents | DARPA". www.darpa.mil. Retrieved 2025-06-04.
^ Petullo, W. Michael; Wenyuan Fei; Solworth, Jon A.; Gavlin, Pat (May 2014). "Ethos' Deeply Integrated Distributed Types". 2014 IEEE Security and Privacy Workshops. IEEE. pp. 167–180. doi:10.1109/SPW.2014.32. ISBN 978-1-4799-5103-1.
^ Nagarakatte, Santosh; Zhao, Jianzhou; Martin, Milo M.K.; Zdancewic, Steve (2009-06-15). "SoftBound: Highly compatible and complete spatial memory safety for C". Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM. pp. 245–258. doi:10.1145/1542476.1542504. ISBN 978-1-60558-392-1.
^ Jung, Ralf; Jourdan, Jacques-Henri; Krebbers, Robbert; Dreyer, Derek (January 2018). "RustBelt: securing the foundations of the Rust programming language". Proceedings of the ACM on Programming Languages. 2 (POPL): 1–34. doi:10.1145/3158154. ISSN 2475-1421.
^ Torrey, Jacob I.; Bridgman, Mark P. (May 2015). "Verification State-Space Reduction through Restricted Parsing Environments". 2015 IEEE Security and Privacy Workshops. IEEE. pp. 106–116. doi:10.1109/SPW.2015.30. ISBN 978-1-4799-9933-0.
^ Bratus, Sergey; Hermerschmidt, Lars; Hallberg, Sven M.; Locasto, Michael E.; Momot, Falcon D.; Patterson, Meredith L.; Shubina, Anna (2017). "Curing the Vulnerable Parser: Design Patterns for Secure Input Handling" (PDF). USENIX ;login. 42 (1): 32–39.