18 Text data foundations
Learning Objectives
- Explain principles of language models including tokens and embeddings
- Explain the logic of language based machine learning models
- Describe language model uses
- Use foundational text tools
Chapter content
The purpose of this chapter is to introduce you to concepts and tools that are used with business text. Information is often stored in forms that are not accessible in a structured format that can easily be fed into traditional analyses. For example, information could be contained in text documents, images, or audio files. This and the next chapters will focus on text data. Textual content can be small strings in an otherwise structured data set, text that contains numeric or qualitative data, longer strings that can be converted to or included in a structured data set, or may be the primary data source.
Text in accounting- text sources, types, and information content
Businesses produce, consume, and store vast amounts of text data. This text data can be used to understand customer sentiment, to predict future performance, to understand the competitive landscape, and to make better decisions. Sometimes text data is a small part of a larger data set, and sometimes it is the primary data source. Text data can be structured, semi-structured, or unstructured.
Text is also an important source and product of accounting information. For example, financial reports contain more textual information than numbers in the form of notes, supplementary disclosures, and explanations. Financial reports also incorporate contracts, leases, bonds, workflows, and other documents that lead to accounting numbers and require financial disclosures.
Various parties may use the text created by or used by accountants and accounting procedures and systems and may use textual tools to make working with the text more efficient. Accountants may use textual tools to extract key pieces of information from contracts and legal documents, to summarize documents, or to automate reporting. Regulators may use textual tools to search for changes in documents, to verify content, and to identify risks. Investors may use textual tools to infer intent, capture sentiment, or predict outcomes.
Foundations of textual analysis
This section describes foundational concepts and tools that may useful by themselves or as a broader part of understanding the use of generative and agentic AI models.
Computer models of text require transforming text components to numerical representations that a computer can interpret and use as part of mathematical tools. Working with text has evolved from simple methods to complex machine learning models with billions of parameters (large language models or LLMs).
Tokens
The foundational starting point for working with text is called a token. A token is a character or combination of characters that can be treated as a unit. Tokens can be specific words, for example, “writing”. How tokens are defined represents tradeoffs in computational power, generalizability, and accuracy. Continuing with the example word of “writing” can help explain some of these tradeoffs. “writing” can have different forms depending on capitalization: “writing”, “Writing”, “WRITING”. “writing” can also have similar meaning as other forms of the word: “writing”, “write”, “written”. Other words can have similar meaning: “transcribing”, “speaking”, “noting”, “annotating”. Finally, qualifications can change the meaning of the word: “not writing”.
Consider some of the tradeoffs with possible definitions of tokens related to the word “writing” and the relatively simple task of counting the number of times “writing” is included in a sample text. Suppose you were to count the number of instances of “writing”. The first definition of a token could be a word. In this case “writing” would be a token. “write” would be a separate token. “Writing” would be a different token.
You might cosider different definitions for tokens. When you count the words, do you including “Writing”? Do you include “write”? Do you include “annotating”? Do you include “not writing”? A low computational cost approach would be to count the number of instances of “writing”. As the possibilities expand, the computational costs increase. Some heuristics have been used in the past to simplify what is meant by a token. Some of these are defined below.
The starting point for tools for working with text require the ability to split and match characters and combinations of characters. The foundational tool is called regular expressions.
Regular expressions
Regular expressions (regex) are patterns used to match sequences of characters within text. They work by defining a set of rules—using a combination of ordinary characters (like letters and numbers) and special symbols (called metacharacters)—that a regular expression engine uses to search, match, or manipulate text.
Regular expression engines work by scanning the input text for matches to the pattern. When it finds a sequence that fits the pattern, it reports a match. The regular expression can then be used to search, extract, replace, or split at the match point.
Examples of a regular expression pattern and the related match are provided below.
-
Pattern:
\d{3}-\d{2}-\d{4}
matches a social security number format like “123-45-6789”. -
Pattern:
^[A-Z][a-z]+
matches a capitalized word at the start of a line.
Elements of regular expressions are described in the dropdown list below.
Even in a period with relatively low cost and highly accessible language models, regular expressions are useful for rule based tasks and tasks that clean and prepare data. The end of this chapter will demonstrate regular expression usage in R.
From tokens to LLMs
With tokens defined, text processing developed from simple word counts or dictionaries, to what is typically referred to as natural language processing.
Word counts are a foundational concept in language processing and Natural Language Processing (NLP). At their simplest, word counts refer to tallying how many times each word appears in a given text or corpus. This process is often called unigram word count or word frequency.
To aggregate to a document, word counts might be combined as a bag of words. In a bag of words aggregation, each document is represented as a vector of word counts, disregarding grammar and word order. Bag of words might be extended by weighting words by their frequency (or inverse frequency).
Dictionaries are structured resources that list words, their meanings, grammatical information, and sometimes usage examples. In language processing, dictionaries and related lexical resources (like glossaries and encyclopedias) apply meaning to lists of words. For example, a dictionary of positive words could be used to count how positive a document is. The net sentiment of a document might use the number of positive words minus the number of negative words from the relevant dictionaries.
Word counts and dictionaries are computationally simple. The problem with word counts and dictionaries is that they do not capture the context or more subtle meanings of text. Other were then developed that allow computers to capture more meaning from text. However, the most complex methods require extraordinary amounts of data and computing power.
The next step in the progress of natural language processing is to capture meaning of tokens in the context of surrounding tokens. The simplest form of capturing context is to longer combinations of tokens. The simplest approach is to include nearby tokens as n-grams or skip-grams. n-grams are n groups of individual tokens. For example, a bi-gram would include two token combinations as separate tokens. Skip-grams include nearby, but not contiguous tokens. The challenge with expanded word combinations is that the number of tokens used to represent a text increases dramatically relative to a simple single token bag-of-words.
The final step to capturing meaning is capturing similarities and differences between words. Word embeddings represent words as vectors of numbers. This approach allows computers to process and understand text by capturing the semantic and syntactic relationships between words in a way that traditional methods cannot. For an embedding, each word is mapped to a numeric vector, typically with tens or hundreds of dimensions. The position of each word in this vector space is determined so that words with similar meanings or that appear in similar contexts are located close to one another. The distance and direction between vectors encode the degree of similarity between words. For example, “king” and “queen” will have vectors that are close together, while “king” and “banana” will be far apart. Word embeddings are trained on large text corpora. The training process adjusts the vectors based on the contexts in which words appear, enabling the vectors to capture nuanced linguistic patterns and relationships.
Word embeddings and machine learning models (recurrent neural networks, RNN) have been combined with extreme amounts of data and computational power to create large language models. These models are designed to take input tokens and predict the next token. The models are typically used iteratively so that the predicted token is combined with the input tokens to work as the next set of input tokens and the following token is predicted. This processes is repeated so that it appears that complete sentences, paragraphs, and documents are being created even though the step-by-step process is a single token. Because these models have been trained on billions of documents and the models contain billions of parameters, they can generate a wide variety of text in seemingly intelligent ways.
Understanding the details of natural language processing and language models is beyond the purpose of this textbook. This chapter will end with the application of the simplest tool – regular expressions. The next chapter will explore some applications of large language models.
Examples in R
To explore regular expressions, this section will use the stringr package that is part of the tidyverse set of packages (there are other options in R and other languages). A noticeable difference between stringr and some other implementations of regular expressions in other languages is how some expressions are written with “\”. If you have used regular expressions before, the main difference is that in stringr, you need to use “\\” instead of “\”. This is because “\” is an escape character in R. Many cheatsheets are available for referencing regular expressions. The stringr package cheatsheet is available here: https://rstudio.github.io/cheatsheets/strings.pdf. The regular expression list is on the second page of the pdf while the stringr functions are on the first page.
The demonstration of regular expressions with stringr will use a 10-K filing from Target Corporation. This is a public document that is filed with the Securities and Exchange Commission (SEC). This 10-K is available on the SEC Edgar website, on Target’s website, and[is available here: https://www.dropbox.com/scl/fi/smz1btldpblv99m5bu1a8/Target2009.txt?rlkey=re0lfufxzlww3jzh5w4zi8pi9&dl=0.
First, load the packages and read in the text file as a text string.
library(tidyverse)
library(readr)
txtstr <- read_file("Target2009.txt")
The stringr package has various functions for example, str_detect, str_count, and str_extract. The functions have the same syntax. The first argument is the string to search. The second argument is the pattern to search for. The pattern can be a simple string or a regular expression. The code below provides some examples.
The str_detect function will return TRUE or FALSE if a pattern is found in a string. The ignore_case argument is optional but allows for patterns to match upper and lower case letters. The dotall argument is optional but allows for the “.” to match newlines.
str_detect(txtstr, regex("Target", ignore_case = TRUE, dotall=TRUE))
The first argument “txtstr” is the string to search. In a data frame, this could be a column, it could be a list, in this case, it is the entire text document. The regex function is used to specify the pattern to search for. The pattern can be a simple string or a regular expression. The above code can be altered to use patterns rather than only literals. The following example is to extract text when using special matching characters. str_extract will return the first instance of a pattern found in a string (str_extract_all will return all instances of a pattern found in a string).
str_extract(txtstr, regex("\\w+Target", ignore_case = FALSE, dotall=TRUE))
\w is a special matching character for any word character, i.e. letters. For stringr the special matching character has to be called with \\w. + matches the preceding item one or more times. This example extracts the first instance that has one or more word character followed by the literal “Target”. Importantly, \\w+ does not match spaces or other non-word characters.
str_count will return the number of times a pattern is found in a string.
str_count(txtstr, regex("\\w+Target", ignore_case = FALSE, dotall=TRUE))
str_count(txtstr, regex("Target", ignore_case = FALSE, dotall=TRUE))
str_count could be used for other purposes, for example, counting the number of words:
str_count(txtstr, regex(" \\w+ ", ignore_case = TRUE, dotall=TRUE))
Regular expressions can be useful when text needs to be cleaned or edited. For example, the find and replace tool is a regular expression tool. This could be done in the Target example.
str_replace_all(txtstr, regex("Target", ignore_case = FALSE, dotall=TRUE), "Walmart")
The str_replace or str_replace_all function adds another argument to the function that is the replacement for the matched string.
Tutorial video
Conclusion
Review
Mini-case video
References