Text nailing

Text Nailing (TN) is an information extraction method used in the fields of natural language processing (NLP), computational linguistics, and health informatics. It was developed to address challenges in extracting structured, analyzable data from large volumes of unstructured text such as clinical notes, research articles, and other narrative documents.^[1]^[2]

Unlike fully automated machine learning systems, Text Nailing emphasizes a "human-in-the-loop" approach: a person interactively reviews small portions of text to identify expressions that are common, non-negated, and semantically informative. These "nailed expressions" are then converted into simplified alphabetical-only forms, creating consistent and homogeneous representations of the text. This hybrid strategy allows TN to combine the precision of expert-guided annotation with the scalability of computational techniques.^[3]

The method has been applied to improve predictive modeling in medicine, enhance the performance of text classifiers, and reduce reliance on large manually labeled datasets.^[4]^[5]

History and development

Text Nailing was first developed in the mid-2010s at Massachusetts General Hospital as part of efforts to improve information extraction from electronic health records (EHRs).^[6] The method was initially tested in several clinical scenarios, including the extraction of smoking status from narrative notes, identification of patients with physician-documented insomnia, and detection of family history of coronary artery disease.^[7]

Further applications demonstrated its potential in refining widely used clinical risk models. For example, Text Nailing was employed to improve the accuracy of the Framingham risk score in patients with non-alcoholic fatty liver disease,^[4] and to classify patient non-adherence in type-2 diabetes care. Its emphasis on identifying non-negated, recurrent expressions yielded higher accuracy compared with traditional machine learning approaches that required large, manually labeled datasets.^[5]

The method also sparked discussion about the broader role of human-in-the-loop approaches in health informatics. Commentators noted that machine learning in medicine often relied on assumptions about infinite linguistic variation, while clinical text in practice tends to reuse a limited set of expressions.^[8] A subsequent letter in Communications of the ACM emphasized that using non-negated expressions could increase the accuracy of text-based classifiers.^[9]

In July 2018, researchers at Virginia Tech and the University of Illinois at Urbana–Champaign cited Text Nailing as an example of "progressive cyber-human intelligence" (PCHI), recognizing its hybrid model of combining expert human annotation with computational scalability.^[10]

At the same time, critiques of machine learning in health care highlighted the risks of inflated expectations. Chen and Asch (2017) argued that more thoughtful approaches were needed to avoid disillusionment with predictive modeling in medicine.^[11] In this context, Text Nailing was described by its co-creator, Uri Kartoun, as a method that initially faced skepticism for relying on "simple tricks" and human annotation, but ultimately gained acceptance as a more robust approach to clinical text analysis.^[12]

Source code

A sample code for extracting smoking status from narrative notes using "nailed expressions" is available in GitHub.^[13]

Criticism of machine learning in health care

Chen & Asch 2017 wrote "With machine learning situated at the peak of inflated expectations, we can soften a subsequent crash into a "trough of disillusionment" by fostering a stronger appreciation of the technology's capabilities and limitations."^[11]

A letter published in Communications of the ACM, "Beyond brute force", emphasized that a brute force approach may perform better than traditional machine learning algorithms when applied to text. The letter stated "... machine learning algorithms, when applied to text, rely on the assumption that any language includes an infinite number of possible expressions. In contrast, across a variety of medical conditions, we observed that clinicians tend to use the same expressions to describe patients' conditions."^[14]

In his viewpoint published in June 2018 concerning slow adoption of data-driven findings in medicine, Uri Kartoun, co-creator of Text Nailing states that " ...Text Nailing raised skepticism in reviewers of medical informatics journals who claimed that it relies on simple tricks to simplify the text, and leans heavily on human annotation. TN indeed may seem just like a trick of the light at first glance, but it is actually a fairly sophisticated method that finally caught the attention of more adventurous reviewers and editors who ultimately accepted it for publication."^[15]