20 Chapter 20: LLM Basics
Learning outcomes
At the end of this chapter, you should be able to
- Explain principles of language models including tokens and embeddings
- Explain the logic of language based machine learning models
- Use prompting techniques for LLMs
Chapter content
Note: include practice and knowledge checks throughout chapter
<!– FOR READING IN WITH CHANGES:
library(tidyverse)
library(readr)
library(udpipe)
library(word2vec)
library(doc2vec)
library(udpipe)
txtstr <- read_file(“C:/Users/jgreen/Dropbox/ACC648DocumentSharing/Target2009.txt”)
notes <- data.frame(txt=readLines(“C:/Users/jgreen/Dropbox/ACC648DocumentSharing/PatientDatesNotes.txt”))
txtstr <- read_file(“C:/Users/jgreen/Dropbox/ACC648DocumentSharing/Target2009.txt”)
txtstr2 <- str_replace_all(txtstr, “<.*?>”, “”)
txtstr3 <- str_replace_all(txtstr2, “ |\\d|[:punct:]|\\n|\\t|\\r”, “”)
txtstr3 <- str_replace_all(txtstr3, “\\s\\s+”, ” “)
reviews <- read.csv(“C:/Users/jgreen/Dropbox/ACC648DocumentSharing/AmazonReviews/train.csv”, header=FALSE)
names(reviews) <- c(“rating”,”title”,”review”)
set.seed(1234)
reviews <- reviews %>%
sample_frac(0.5)
reviews <- reviews %>%
drop_na(review,rating,title) %>%
filter(nchar(review)>=100)
reviews$review <- txt_clean_word2vec(reviews$review, ascii = TRUE, alpha = TRUE, tolower = TRUE, trim = TRUE)
reviews$review <- str_extract(reviews$review,regex(“(\\w+(\\W|$)){1,1000}”))
reviews$doc_id <- 1:nrow(reviews)
reviews <- reviews %>%
rename(text=review)
mdl <- read.paragraph2vec(“embedmdl.bin”)
emb <- as.matrix(mdl)
dset <- data.frame(
rating = reviews$rating,
emb
)
rm(emb,reviews)
–>
# Textual tools and natural language processing
Information is often stored in forms that are not easily accessible in a structured format that can easily be fed into traditional analyses. For example, information could be contained in text documents, images, or audio files. In this section, we will focus on text data. Textual content can be small strings in an otherwise structured data set, text that contains numeric or qualitative data, longer strings that can be converted to or included in a structured data set, or may be the primary data source.
## Text in business — text sources, types, and information content
Businesses produce, consume, and store vast amounts of text data. This text data can be used to understand customer sentiment, to predict future performance, to understand the competitive landscape, and to make better decisions. Sometimes text data is a small part of a larger data set, and sometimes it is the primary data source. Text data can be structured, semi-structured, or unstructured.
## Purpose
The purpose of this chapter is to introduce you to concepts and tools that are used with business text. We will spend the most time on tools that can be used to simplify repetitive tasks such as counting, cleaning, finding, and extracting parts of text. We will also introduce you to more advanced tools that can be used to analyze text data, such as natural language processing (NLP) and machine learning models. We will use examples from business texts to illustrate these tools.
### Semi-structured strings
Many structured data sets contain columns that are composed of or include character strings such as names, addresses, descriptions, or notes. These strings may contain valuable information that can be extracted and used in analysis. For example, a column that contains a description of a product may contain information about the product’s features, quality, or price. A column that contains a customer’s address may contain information about the customer’s location, income, or preferences. A column that contains a note about a customer service interaction may contain information about the customer’s satisfaction, loyalty, or likelihood to recommend the company to others.
Other times, columns may contain errors or inconsistencies that need to be cleaned or corrected before they can be used in analysis. For example, a column that contains a customer’s name may contain misspellings, variations, or missing information that need to be standardized or imputed. A column that contains a product description may contain abbreviations, acronyms, or jargon that need to be expanded or translated. A column that contains a note about a customer service interaction may contain irrelevant or sensitive information that needs to be redacted or anonymized. A column that contains a date or time may contain errors or inconsistencies that need to be validated or corrected. A column that contains a phone number may contain formatting or validation errors that need to be standardized or corrected.
For semi-structured text, regular expressions and string manipulation functions can be used to extract, clean, or correct information. Regular expressions are patterns that can be used to match, search, or replace text. String manipulation functions are functions that can be used to split, join, or transform text. Together, regular expressions and string manipulation functions can be used to extract, clean, or correct information in semi-structured text.
### Text containing numeric or qualitative data
Related to semi-structured strings, text data may contain numeric or qualitative data that needs to be extracted and converted to a structured format. For example, a column that contains a product description may contain information about the product’s price, quantity, or availability. A column that contains a customer’s address may contain information about the customer’s location, income, or preferences. A column that contains a note about a customer service interaction may contain information about the customer’s satisfaction, loyalty, or likelihood to recommend the company to others.
For text containing numeric or qualitative data, regular expressions and string manipulation functions can be used to extract information.
### Long strings that can be converted to structured data
Sometimes text data is stored in long strings that can be converted to a structured format. Various dimensions of the text can summarize the long strings. Different tools have been developed to convert text data to structured data. This process is called natural language processing (NLP). In some of its simplest forms, NLP can be used to count the frequency of words or phrases in a text, to identify the sentiment of a text, to classify a text into categories with dictionaries that reflect the uncertainty, tone, or other characteristics of the text.
### Text as the primary data source
Sometimes text data is the primary data source. For example, a company may have a large number of text documents that contain information about its products, customers, competitors, or industry. In addition to the tools used in the previous sections, more advanced tools can be used to analyze text data. These tools may include machine learning models that use structured text to predict topics, categories, sentiment, specific outcomes, or other words. These models may use supervised or unsupervised learning techniques to identify patterns in the text data. Generative AI is example of unsupervised learning that can be used to generate new text based on the patterns in the text data.
Companies use these tools to predict which emails are spam to create spam filters, to recommend products based on customer comments and other input, to automatically create reports based on input, or to create customer or employee chat agents that respond to user inputs.
### Accounting related text
Text is also an important source and product of accounting information. For example, financial reports contain more textual information than numbers in the form of notes, supplementary disclosures, and explanations. Financial reports also incorporate contracts, leases, bonds, workflows, and other documents that lead to accounting numbers and require financial disclosures.
Various parties may use the text created by or used by accountants and accounting procedures and systems and may use textual tools to make working with the text more efficient. Accountants may use textual tools to extract key pieces of information from contracts and legal documents, to summarize documents, or to automate reporting. Regulators may use textual tools to search for changes in documents, to verify content, and to identify risks. Investors may use textual tools to infer intent, capture sentiment, or predict outcomes.
## Data
We will use different sources of text for this chapter.
The first is a 10-K filing from Target Corporation. This is a public document that is filed with the Securities and Exchange Commission (SEC). This 10-K is available on the SEC Edgar website, on Target’s website, and [is available here](https://www.dropbox.com/scl/fi/smz1btldpblv99m5bu1a8/Target2009.txt?rlkey=re0lfufxzlww3jzh5w4zi8pi9&dl=0).
The second is a set of textual physician notes from Kaggle with dates that [is available here](https://www.dropbox.com/scl/fi/jumb2la9js8mir10gpevm/PatientDatesNotes.txt?rlkey=pgdzt4vou069ylr2os2x8j2wb&dl=0).
The third is a corpus of text from Amazon product reviews that [is from this paper](https://cs.stanford.edu/people/jure/pubs/reviews-recsys13.pdf) and [the training data is available here](https://www.dropbox.com/scl/fi/8k049zmovmqthwvx45k9w/train.csv?rlkey=vh28qq3a2iyg5an39fc5bdzyo&st=npbdru1u&dl=0).
## Regular expressions and stringr
All textual analysis works with strings that are stored as characters. How these characters are analyzed and stored leads to different tools. The foundation of working with text is regular expressions. Regular expressions represent individual characters, types of characters, and character patterns that allow for computerized processing of text. Regular expressions are used in many programming languages and text editors to search for and manipulate text. Regular expressions across languages can differ slightly, but the principles are the same and the basic patterns are similar.
We will use the stringr package that is part of the tidyverse set of packages to use regular expressions (there are other options in R and other languages). A noticeable difference between stringr and some other implementations of regular expressions in other languages is how some expressions are writted with “\”. If you have used regular expressions before, the main difference is that in stringr, you need to use “\\” instead of “\”. This is because “\” is an escape character in R. Many cheatsheets are available for referencing regular expressions. The stringr package cheatsheet [is available here](https://rstudio.github.io/cheatsheets/strings.pdf). The regular expression list is on the second page of the pdf while the stringr functions are on the first page.
Here we will first introduce the basic stringr syntax and how to use regular expressions with stringr. We will then introduce some basic regular expressions that can be used to count, clean, find, and extract parts of text.
We can first load the stringr package as part of the tidyverse and get a document to work with. We will start with the target 10-K document. We also need a function to read in the text file. We will use the readr package to read in the text file.
“`r
library(tidyverse)
library(readr)
txtstr <- read_file(“Target2009.txt”)
“`
From stringr, let’s work with three functions: str_detect, str_count, and str_extract. str_detect will return TRUE or FALSE if a pattern is found in a string. str_count will return the number of times a pattern is found in a string. str_extract will return the first instance of a pattern found in a string (str_extract_all will return all instances of a pattern found in a string).
All of the functions have the same syntax. The first argument is the string to search. The second argument is the pattern to search for. The pattern can be a simple string, a regular expression, or a regular expression pattern. We will set up a syntax to use for all of the examples.
“`r
str_detect(txtstr, regex(“regular expression pattern goes here”, ignore_case = TRUE, dotall=TRUE))
“`
The ignore_case argument is optional but allows for patterns to match upper and lower case letters. The dotall argument is optional but allows for the “.” to match newlines.
The first argument “txtstr” is the string to search. In a data frame, this could be a column, it could be a list, in this case, it is the entire text document. The regex function is used to specify the pattern to search for. The pattern can be a simple string or a regular expression.
Running the code above will return FALSE because nowhere in the document is “regular expression patter goes here” found. We can alter the pattern to match something that is in the document.
“`r
str_detect(txtstr, regex(“Target”, ignore_case = TRUE, dotall=TRUE))
[1] TRUE
“`
Let’s try using some regular expressions that can match patterns rather than specific strings. Suppose we are trying to find any brand mention of Target with words before “Target” (i.e. “SuperTarget”) but we don’t know what the words are or there might be different words. To see what we are capturing, we will use str_extract to grab what we see. str_extract will only get the first match that it find.
Here we will try some alternatives. We will use “.” that matches any character, “[a-z]” that matches any lowercase letter, and “\\w” that matches any word character. Note that we are setting the ignore_case argument to “FALSE” (the default), because we are looking for specific cases.
“`r
str_extract(txtstr, regex(“.Target”, ignore_case = FALSE, dotall=TRUE))
[1] ” Target”
str_extract(txtstr, regex(“[a-z]Target”, ignore_case = FALSE, dotall=TRUE))
[1] “rTarget”
str_extract(txtstr, regex(“\\wTarget”, ignore_case = FALSE, dotall=TRUE))
[1] “rTarget”
“`
Before discussing the results, it is helpful to understand how the regular expression evaluates a pattern. A regular expression evaluates a pattern from left to right, one character at a time. It also walks through the text from left to right, one character at a time. With “|” representing the cursor (don’t confuse this with the “or” evaluation from regular expressions), let’s describe how the regular expression works. We will work through the first version above to see how the regular expression works.
First, the regular expression is matching “.Target”. The “.” matches any character. The regular expression is looking for any character followed by “Target”. The part of the text that the first match comes from is “Portions of Target’s Proxy Statement”. First, the algorithm is going through character-by-character. It starts at the first character:
* “|.Target” — “|Portions of Target’s Proxy Statement”.
The cursor is at the beginning of the “P” character. The first part of the pattern is “.” which matches anything. “P” is part of anything, so the pattern up to this point matches, it then moves to the next part of the text and the pattern. Now we have:
* “.|Target” — “P|ortions of Target’s Proxy Statement”.
The cursor has moved past the “.” match and the “P” character. The algorithm now sees “T” in the pattern. It looks for “T” at the cursor in the text. Now it finds “o” at the cursor, so the pattern fails. It then resets the position of the pattern and starts again from this point in the text.
* “|.Target” — “P|ortions of Target’s Proxy Statement”.
The cursor is now at the beginning of the “o” character. The same process repeats. The “.” matches, but the “T” does not. This repeats until the space before “TARGET” is reached.
* “|.Target” — “Portions of| Target’s Proxy Statement”.
Now, the cursor is starting the pattern again. It matches anything and here anything is the space. When the pattern moves to the next point, it finds the “T”.
* “.|Target” — “Portions of |Target’s Proxy Statement”.
Because the pattern matches up to this point, it moves ahead.
* “.T|arget” — “Portions of T|arget’s Proxy Statement”.
The steps repeat until the pattern is matched.
The second pattern “[a-z]” matches any lowercase letter. It will not match in the first instance of “Target” as the first pattern. It will also pass by other instances of “Target”, for example “ Target Corporation” because it has a “;” before the T. The first instance that it finds is “rTarget”. In the third pattern the “\\w” matches any word character. This can be upper or lower case. The first instance that it finds is also “rTarget”.
Let’s expand the pattern backwards to see if we can get “SuperTarget”.
“`r
str_extract(txtstr, regex(“[a-z]+Target”, ignore_case = FALSE, dotall=TRUE))
[1] “uperTarget”
str_extract(txtstr, regex(“\\w+Target”, ignore_case = FALSE, dotall=TRUE))
[1] “SuperTarget”
“`
Notice that in the extract statements, the first pattern finds “uperTarget” while the other gets “SuperTarget”. This is because the first pattern is looking for lowercase letters before “Target” while the second pattern is looking for any word characters.
The “+” matches one or more of the preceding character. Let’s see how the first part of the pattern works here.
* “|[a-z]+Target” — “S|uperTarget”.
Here, the pattern is looking for one or more lowercase letters. At the cursor point, if finds a lowercase “u”. The cursor moves ahead and finds another lowercase letter.
* “[a-z]|+Target” — “Su|perTarget”.
The pattern continues to match lowercase letters. When it hits the uppercase “T”, it no longer matches the lowercase letters and it finds the next part of the pattern, so it moves on to the next part of the pattern. The pattern then matches “Target”.
If you look in the 10-K document, you see that the text file contains HTML markdown tags in it. We can use regular expressions to remove a lot of the HTML tag information. We can use the “<.*?>” pattern to match any HTML tag. The “*” matches zero or more of the preceding character. The “?” makes the “*” non-greedy. This means that it will match the smallest possible string that matches the pattern.
First, let’s extract a little of the text to see what it looks like without printing everything. Here instead of the “*” or the “+”, we will specify the number of characters to match.
“`r
str_extract(txtstr, regex(“.{5000}”, ignore_case = TRUE, dotall=TRUE))
[1] “—–BEGIN PRIVACY-ENHANCED MESSAGE—–\nProc-Type: 2001,MIC-CLEAR\nOriginator-Name: webmaster@www.sec.gov\nOriginator-Key-Asymmetric:\n MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen\n TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB\nMIC-Info: RSA-MD5,RSA,\n L7CdK5nwZJdHN9N4C6RDDayMih9LBecZFr1sxAzyr/WHSgckw8MNwJOxASeAgAxr\n KOsz1B9/X1pW4PwAthLvtA==\n\n<SEC-DOCUMENT>0001047469-09-002623.txt : 20090313\n<SEC-HEADER>0001047469-09-002623.hdr.sgml : 20090313\n<ACCEPTANCE-DATETIME>20090313121617\nACCESSION NUMBER:\t\t0001047469-09-002623\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t25\nCONFORMED PERIOD OF REPORT:\t20090131\nFILED AS OF DATE:\t\t20090313\nDATE AS OF CHANGE:\t\t20090313\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tTARGET CORP\n\t\tCENTRAL INDEX KEY:\t\t\t0000027419\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tRETAIL-VARIETY STORES [5331]\n\t\tIRS NUMBER:\t\t\t\t410215170\n\t\tSTATE OF INCORPORATION:\t\t\tMN\n\t\tFISCAL YEAR END:\t\t\t0131\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-06049\n\t\tFILM NUMBER:\t\t09678638\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t1000 NICOLLET MALL\n\t\tCITY:\t\t\tMINNEAPOLIS\n\t\tSTATE:\t\t\tMN\n\t\tZIP:\t\t\t55403\n\t\tBUSINESS PHONE:\t\t6123046073\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\t1000 NICOLLET MALL\n\t\tCITY:\t\t\tMINNEAPOLIS\n\t\tSTATE:\t\t\tMN\n\t\tZIP:\t\t\t55403\n\n\tFORMER COMPANY:\t\n\t\tFORMER CONFORMED NAME:\tDAYTON HUDSON CORP\n\t\tDATE OF NAME CHANGE:\t19920703\n\n\tFORMER COMPANY:\t\n\t\tFORMER CONFORMED NAME:\tDAYTON CORP\n\t\tDATE OF NAME CHANGE:\t19690728\n</SEC-HEADER>\n<DOCUMENT>\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>a2190597z10-k.htm\n<DESCRIPTION>FORM 10-K\n<TEXT>\n<HTML>\n<HEAD>\n</HEAD>\n<BODY BGCOLOR=\”#FFFFFF\” LINK=BLUE VLINK=PURPLE>\n<BR>\n\n<P style=\”font-family:arial;text-align:justify\”><FONT SIZE=2>\n\n\n<!– COMMAND=ADD_BASECOLOR,\”Black\” –>\n\n\n\n\n<!– COMMAND=ADD_DEFAULTFONT,\”font-family:arial;\” –>\n\n\n\n\n<!– COMMAND=ADD_TABLESHADECOLOR,\”#CCEEFF\” –>\n\n\n\n\n<!– COMMAND=ADD_STABLERULES,\”border-bottom:solid #000000 1.0pt;\” –>\n\n\n\n\n<!– COMMAND=ADD_DTABLERULES,\”border-bottom:solid #000000 2.25pt;\” –>\n\n\n\n\n\n<!– COMMAND=ADD_SCRTABLERULES,\”border-bottom:solid #000000 1.0pt;margin-bottom:0pt;\” –>\n\n\n\n\n<!– COMMAND=ADD_DCRTABLERULES,\”border-bottom:solid #000000 2.25pt;margin-bottom:0pt;\” –>\n\n\n<!– PARA=JUSTIFY –>\n</FONT></P>\n\n<P style=\”font-family:arial;text-align:justify\”><FONT SIZE=2>\n<A HREF=\”#bG11001A_main_toc\”>Table of Contents</A> </FONT></P>\n\n<P style=\”font-family:arial;text-align:justify\”><FONT SIZE=2><I> <div style=\”width:100%;border-top:solid #000000 3.0pt;padding:0in 0in 0in 0in;font-size:3.0pt;\”></div>\n<div style=\”width:100%;border-top:solid #000000 1.0pt;padding:0in 0in 0in 0in;font-size:4.0pt;\”></div> </I></FONT></P>\n\n<P ALIGN=\”CENTER\” style=\”font-family:arial;\”><FONT SIZE=4><B>UNITED STATES<BR>\nSECURITIES AND EXCHANGE COMMISSION<BR> </B></FONT><FONT SIZE=2>Washington, D.C. 20549 </FONT></P>\n\n<P ALIGN=\”CENTER\” style=\”font-family:arial;\”><FONT SIZE=2><I>\n\n<!– COMMAND=ADD_LINERULETXT,NOSHADE COLOR=\”#000000\” SIZE=\”1.0PT\” WIDTH=\”25%\” ALIGN=\”CENTER\” –>\n<HR NOSHADE COLOR=\”#000000\” SIZE=\”1.0PT\” WIDTH=\”25%\” ALIGN=\”CENTER\” >\n\n\n </I></FONT><FONT SIZE=2>\n\n<!– COMMAND=ADDING_LINEBREAK –>\n\n<BR></FONT></P>\n\n<P ALIGN=\”CENTER\” style=\”font-family:arial;\”><FONT SIZE=4><B>FORM 10-K </B></FONT></P>\n\n<!– COMMAND=ADD_TABLEWIDTH,\”100%\” –>\n\n<!– User-specified TAGGED TABLE –>\n<DIV ALIGN=\”CENTER\”><TABLE width=\”100%\” BORDER=0 CELLSPACING=0 CELLPADDING=0>\n<TR><!– TABLE COLUMN WIDTHS SET –>\n<TD WIDTH=\”48\” style=\”font-family:arial;\”></TD>\n<TD WIDTH=\”12\” style=\”font-family:arial;\”></TD>\n<TD WIDTH=\”434\” style=\”font-family:arial;\”></TD>\n<!– TABLE COLUMN WIDTHS END –></TR>\n\n<TR VALIGN=\”BOTTOM\”>\n<TD ALIGN=\”CENTER\” VALIGN=\”TOP\” style=\”font-family:arial;\”><FONT SIZE=2><B>(Mark One)</B></FONT></TD>\n<TD VALIGN=\”TOP\” style=\”font-family:arial;\”><FONT SIZE=2> </FONT></TD>\n<TD VALIGN=\”TOP\” style=\”font-family:arial;\”><FONT SIZE=2> </FONT></TD>\n</TR>\n<TR VALIGN=\”TOP\”>\n<TD ALIGN=\”CENTER\” style=\”font-family:arial;\”><BR><FONT SIZE=2><FONT FACE=\”WINGDINGS\”>ý</FONT></FONT></TD>\n<TD style=\”font-family:arial;\”><FONT SIZE=2><BR> </FONT></TD>\n<TD style=\”font-family:arial;\”><BR><FONT SIZE=2><B> ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</B></FONT></TD>\n</TR>\n<TR VALIGN=\”TOP\”>\n<TD COLSPAN=3 ALIGN=\”CENTER\” style=\”font-family:arial;\”><BR><FONT SIZE=2> For the fiscal year ended January 31, 2009</FONT></TD>\n</TR>\n<TR VALIGN=\”TOP\”>\n<TD COLSPAN=3 ALIGN=\”CENTER\” style=\”font-family:arial;\”><BR><FONT SIZE=2><B> OR</B></FONT></TD>\n</TR>\n<TR VALIGN=\”TOP\”>\n<TD ALIGN=\”CENTER\” style=\”font-family:arial;\”><BR><FONT SIZE=2><FONT FACE=\”WINGDINGS\”>o</FONT></FONT></TD>\n<TD style=\”font-family:arial;\”><FONT SIZE=2><BR> </FONT></TD>\n<TD style=\”font-family:arial;\”><BR><FONT SIZE=2><B> TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</B></FONT></TD>\n</TR>\n<TR VALIGN=\”BOTTOM\”>\n<TD COLSPAN=3 ALIGN=\”CENTER\” VALIGN=\”TOP\” style=\”font-fa”
“`
Here we are matching anything 5000 times. The output shows that much of the text is HTML tags. Let’e remove some tags and then do the same thing.
“`r
txtstr2 <- str_replace_all(txtstr, “<.*?>”, “”)
str_extract(txtstr2, regex(“.{5000}”, ignore_case = TRUE, dotall=TRUE))
[1] “—–BEGIN PRIVACY-ENHANCED MESSAGE—–\nProc-Type: 2001,MIC-CLEAR\nOriginator-Name: webmaster@www.sec.gov\nOriginator-Key-Asymmetric:\n MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen\n TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB\nMIC-Info: RSA-MD5,RSA,\n L7CdK5nwZJdHN9N4C6RDDayMih9LBecZFr1sxAzyr/WHSgckw8MNwJOxASeAgAxr\n KOsz1B9/X1pW4PwAthLvtA==\n\n0001047469-09-002623.txt : 20090313\n0001047469-09-002623.hdr.sgml : 20090313\n20090313121617\nACCESSION NUMBER:\t\t0001047469-09-002623\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t25\nCONFORMED PERIOD OF REPORT:\t20090131\nFILED AS OF DATE:\t\t20090313\nDATE AS OF CHANGE:\t\t20090313\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tTARGET CORP\n\t\tCENTRAL INDEX KEY:\t\t\t0000027419\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tRETAIL-VARIETY STORES [5331]\n\t\tIRS NUMBER:\t\t\t\t410215170\n\t\tSTATE OF INCORPORATION:\t\t\tMN\n\t\tFISCAL YEAR END:\t\t\t0131\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-06049\n\t\tFILM NUMBER:\t\t09678638\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t1000 NICOLLET MALL\n\t\tCITY:\t\t\tMINNEAPOLIS\n\t\tSTATE:\t\t\tMN\n\t\tZIP:\t\t\t55403\n\t\tBUSINESS PHONE:\t\t6123046073\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\t1000 NICOLLET MALL\n\t\tCITY:\t\t\tMINNEAPOLIS\n\t\tSTATE:\t\t\tMN\n\t\tZIP:\t\t\t55403\n\n\tFORMER COMPANY:\t\n\t\tFORMER CONFORMED NAME:\tDAYTON HUDSON CORP\n\t\tDATE OF NAME CHANGE:\t19920703\n\n\tFORMER COMPANY:\t\n\t\tFORMER CONFORMED NAME:\tDAYTON CORP\n\t\tDATE OF NAME CHANGE:\t19690728\n\n\n10-K\n1\na2190597z10-k.htm\nFORM 10-K\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTable of Contents \n\n \n \n\nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 \n\n\n\n\n\n\n\n \n\n\n\n\n\nFORM 10-K \n\n\n\n\n\n\n\n\n\n\n\n\n(Mark One)\n \n \n\n\ný\n \n ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n\n\n For the fiscal year ended January 31, 2009\n\n\n OR\n\n\no\n \n TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n\n\n For the transition period\nfrom to \n \n\n\n\n\nCommission file number 1-6049 \n\n\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n \n\nTARGET CORPORATION (Exact name of registrant as specified in its charter) \n\n\n\n\n\n\n\n\n\n\nMinnesota\n(State or other jurisdiction of\nincorporation or organization)\n \n 41-0215170\n(I.R.S. Employer\nIdentification No.)\n\n\n 1000 Nicollet Mall, Minneapolis, Minnesota\n(Address of principal executive offices)\n \n 55403\n(Zip Code)\n\n\n\n\nRegistrant’s telephone number, including area code: 612/304-6073 \n\nSecurities\nRegistered Pursuant To Section 12(B) Of The Act: \n\n\n\n\n\n\n\n\n\n\nTitle of Each Class \n \nName of Each Exchange on Which Registered \n\n\nCommon Stock, par value $.0833 per share\n \nNew York Stock Exchange\n\n\n\n\nSecurities registered pursuant to Section 12(g) of the Act: None \n\n\n\n\n\n\n\n \n\n\n\n \n\nIndicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.\nYes ý No o \n\nIndicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act.\nYes o No ý \n\nNote – Checking the box above will not relieve any registrant required to file reports pursuant to\nSection 13 or 15(d) of the Exchange Act from their obligations under those Sections. \n\nIndicate\nby check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding\n12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days.\nYes ý No o \n\nIndicate by check mark if disclosure of delinquent filers pursuant to Item 405 of Regulation S-K (§229.405 of this chapter) is not\ncontained herein, and will not be contained, to the best of registrant’s knowledge, in definitive proxy or information statements incorporated by reference in Part III of this\nForm 10-K or any amendment to this Form 10-K. ý \n\nIndicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer or a smaller reporting company (as defined\nin Rule 12b-2 of the Act). \n\nLarge\naccelerated filer ý Accelerated filer o Non-accelerated filer\no Smaller reporting company o \n\n\nIndicate by ”
“`
The text is now more readable. We still have some other tags in the text, for example,   is a non-breaking space. When we output the text, we also see formatting characters like “\n” that are newlines, “\t” that are tabs, and “\r” that are carriage returns. If we only wanted raw text, we could continue cleaning some of these as well. On the other hand, there may be times when there is information in the HTML tags that we want to use (for example, when getting a title of a section, or a table, etc — see supplemental section at the end of the chapter).
We can apply this approach to clean up the text to get close to raw text. Let’s remove numbers, punctuation, other tags, formatting characters, and multiple spaces. str_replace_all finds patterns and replaces with any string we specify. Here we will replace with a space.
“`r
txtstr3 <- str_replace_all(txtstr2, “ |\\d|[:punct:]|\\n|\\t|\\r”, “”)
txtstr3 <- str_replace_all(txtstr3, “\\s\\s+”, ” “)
str_extract(txtstr3, regex(“.{5000}”, ignore_case = TRUE, dotall=TRUE))
[1] “BEGIN PRIVACYENHANCED MESSAGEProcType MICCLEAROriginatorName webmasterwwwsecgovOriginatorKeyAsymmetric MFgwCgYEVQgBAQICAfDSgAwRwJAWsNKKAVtBzYZmraGjlWyKXmZvdTINen TWSMvrzLADbmYQaionwgsDWPoaMDtdezXMmzT+B+twIDAQABMICInfo RSAMDRSA LCdKnwZJdHNNCRDDayMihLBecZFrsxAzyrWHSgckwMNwJOxASeAgAxr KOszBXpWPwAthLvtA==txt hdrsgml ACCESSION NUMBERCONFORMED SUBMISSION TYPEKPUBLIC DOCUMENT COUNTCONFORMED PERIOD OF REPORTFILED AS OF DATEDATE AS OF CHANGEFILERCOMPANY DATACOMPANY CONFORMED NAMETARGET CORPCENTRAL INDEX KEYSTANDARD INDUSTRIAL CLASSIFICATIONRETAILVARIETY STORES IRS NUMBERSTATE OF INCORPORATIONMNFISCAL YEAR ENDFILING VALUESFORM TYPEKSEC ACT ActSEC FILE NUMBERFILM NUMBERBUSINESS ADDRESSSTREET NICOLLET MALLCITYMINNEAPOLISSTATEMNZIPBUSINESS PHONEMAIL ADDRESSSTREET NICOLLET MALLCITYMINNEAPOLISSTATEMNZIPFORMER COMPANYFORMER CONFORMED NAMEDAYTON HUDSON CORPDATE OF NAME CHANGEFORMER COMPANYFORMER CONFORMED NAMEDAYTON CORPDATE OF NAME CHANGEKazkhtmFORM KTable of Contents UNITED STATESSECURITIES AND EXCHANGE COMMISSION Washington DC FORM K Mark One ANNUAL REPORT PURSUANT TO SECTION OR d OF THE SECURITIES EXCHANGE ACT OF For the fiscal year ended January OR TRANSITION REPORT PURSUANT TO SECTION OR d OF THE SECURITIES EXCHANGE ACT OF For the transition periodfromtoCommission file number TARGET CORPORATION Exact name of registrant as specified in its charter MinnesotaState or other jurisdiction ofincorporation or organization IRS EmployerIdentification No Nicollet Mall Minneapolis MinnesotaAddress of principal executive offices Zip CodeRegistrants telephone number including area code SecuritiesRegistered Pursuant To SectionB Of The Act Title of Each Class Name of Each Exchange on Which Registered Common Stock par value $ per shareNew York Stock ExchangeSecurities registered pursuant to Sectiong of the Act None Indicate by check mark if the registrant is a wellknown seasoned issuer as defined in Rule of the Securities ActYesNo Indicate by check mark if the registrant is not required to file reports pursuant to Section or Sectiond of the ActYesNo Note Checking the box above will not relieve any registrant required to file reports pursuant toSection or d of the Exchange Act from their obligations under those Sections Indicateby check mark whether the registrant has filed all reports required to be filed by Section or d of the Securities Exchange Act of during the precedingmonths or for such shorter period that the registrant was required to file such reports and has been subject to such filing requirements for the past daysYesNo Indicate by check mark if disclosure of delinquent filers pursuant to Item of RegulationSK sect of this chapter is notcontained herein and will not be contained to the best of registrants knowledge in definitive proxy or information statements incorporated by reference in PartIII of thisFormK or any amendment to this FormK Indicate by check mark whether the registrant is a large accelerated filer an accelerated filer a nonaccelerated filer or a smaller reporting company as definedin Ruleb of the Act Largeaccelerated filer Accelerated filer Nonaccelerated filerSmaller reporting company Indicate by check mark whether the registrant is a shell company as defined in Ruleb of the ActYesNo Aggregate market value of the voting stock held by nonaffiliates of the registrant on August was $ based on the closing price of$ per share of Common Stock as reported on the New York Stock Exchange Composite Index Indicatethe number of shares outstanding of each of registrants classes of Common Stock as of the latest practicable date Total shares of Common Stock par value $ outstanding atMarch were DOCUMENTS INCORPORATED BY REFERENCE Portions of Targets Proxy Statement to be filed on or about April are incorporated into PartIII <ANAME=pagebg> Table of Contents <ANAME=BGAmaintoc> TABLE OF CONTENTS PART IItem Business ItemA Risk Factors ItemB Unresolved Staff Comments Item Properties Item Legal Proceedings Item Submission of Matters to a Vote of Security Holders ItemA Executive Officers PART IIItem Market for Registrants Common Equity Related Stockholder Matters and Issuer Purchases of Equity Securities Item Selected Financial Data Item Managements Discussion and Analysis of Financial Condition and Results of Operations ItemA Quantitative and Qualitative Disclosures About Market Risk Item Financial Statements and Supplementary Data Item Changes in and Disagreements with Accountants on Accounting and Financial Disclosure ItemA Controls and Procedures ItemB Other Information PART IIIItem Directors Executive Officers and Corporate Governance Item Executive Compensation Item Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters Item Certain Relationships and Related Transactions and Director Independence Item Principal Accountant Fees and Services PART IVItem Exhibits and Financial Statement Schedules Signatures ScheduleII Valuation and Qualifying Accounts Exhibit Index Exhibit Computations of”
“`
With something closer to the raw text, let’s learn some other tools.
### Counting
We can count the number of times a pattern is found in a string. We can use the str_count function to do this.
“`r
str_count(txtstr3, regex(“Target”, ignore_case = TRUE, dotall=TRUE))
[1] 349
“`
We can make this more general by matching a pattern. Suppose we want to count the number of words. We will use word characters with a leading and trailing space.
“`r
str_count(txtstr3, regex(” \\w+ “, ignore_case = TRUE, dotall=TRUE))
[1] 52236
“`
We might recognize that this is somewhat of an overcount because the text is completely clean, but it is reasonably close.
Let’s add one more piece. Let’s suppose we want to extract “Target” only if it is preceeded by “Super”. We can use what are called lookarounds. Lookarounds work from the cursor to look before and after the point of the cursor. Importantly, with a lookaround, the cursor does not move. Let’s try this here.
“`r
str_count(txtstr3, regex(“(?<=Super)Target”, ignore_case = TRUE, dotall=TRUE))
[1] 7
“`
Here we get 7 instances where “Target” is found in the text preceeded by “Super”.
Let’s walk through how this works. The “(?<=Super)” is a positive look behind.
* “|(?<=Super)|Target” — “Super|Target”.
Here the cursor finds “T”. Because it has matched “T”, from the position of the cursor it evaluates the look-behind. The look-behind then checks whether the look-behind pattern matches “Super”. The look behind is not used as part of the match, it only qualifies when the “Target” pattern is matched. We can see this by extracting the text.
“`r
str_extract(txtstr3, regex(“(?<=Super)Target”, ignore_case = TRUE, dotall=TRUE))
[1] “Target”
“`
### Extracting parts of text
Now suppose that you are asked to search through hundreds or thousands of documents to find a specific word or phrase and then to read what is around that word or phrase. The “find” function in many text editors can do this. These functions are also based on regular expression tools. However, what if you have to look for different words or phrases, or you have to look for specific patterns around the word or phrase? This is where regular expressions can be very helpful. Let’s say that you want to look for any instance of the word “risk” or the word “uncertain” and then read the surrounding words. We could capture this with a reagular expression pattern that looks for the word “risk” or “uncertain” and then captures surounding text.
“`r
tmp<-str_extract_all(txtstr3, regex(“.{250}(risk|uncertain).{250}”, ignore_case = TRUE, dotall=TRUE))[[1]]
length(tmp)
[1] 39
tmp[[2]]
[1] “t Guide Corporate Responsibility Report and the position descriptions for our Board of Directors and Board committees are also available free of charge in print upon request or atwwwTargetcom click on Investors and Corporate Governance <ANAME=daitemariskfactors> ItemARisk Factors Our business is subject to a variety of risks The most important of these is our ability to remain relevant to ourguests and a brand they trust Meeting our guests expectations requires us to manage various operational and f”
“`
Here we have a list of 39 instances where risk or uncertainty are used in the document. We also have 250 characters before and after these words. We can see the second instance talks about the most important risk. We could put the list in different formats to fit our needs. For example, we could put these in a data frame.
“`r
tmp2<-data.frame(text=tmp)
“`
We could further qualify the search by looking for some other patterns. For example, let’s say we want to look for only where the risks or uncertainty related to customers (Target refers to these as guests).
“`r
tmp<-str_extract_all(txtstr3, regex(“.{250}((?<=guest.{0,250}?)|(?=.{0,250}?guest))(risk|uncertain).{250}”, ignore_case = TRUE, dotall=TRUE))[[1]]
length(tmp)
[1] 3
tmp[[2]]
[1] “ur merchandise offerings including food drug and childrens products do not meet applicable safety standards or our guestsexpectations regarding safety we could experience lost sales experience increased costs and be exposed to legal and reputational risk All of our vendors must comply with applicable product safetylaws and we are dependent on them to ensure that the products we buy comply with all safety standards Events that give rise to actual potential or perceived product safety concerns includi”
“`
Note that this step takes quite a bit longer because it has to perform backward and forward lookarounds at each point. The backward or forward instance for finding customer or guest is slightly complicated because it has to do the “or” from the same cursor point. The first part, “(?<=guest.{0,250}?)” says find customer or guest before the cursor point and allow anywhere between zero and 250 characters between “guest” and the cursor point. The second part, “(?=.{0,250}?guest)”, says match anything between zero and 250 characters after the cursor point (this will include risk or uncertainty) and then “guest”. The | says find the lookbehind OR the lookahead. The placements of parentheses is also important for what is evaluated together.
Most importantly, what might take minutes to do manually, can be done in seconds with a regular expression. However, understanding regular expressions and how to create them is necessary.
### Extracting specific items
Let’s take another case in which we are trying to record the dates from physician’s notes. See what some of the notes look like here or by viewing the data frame.
“`r
notes <- data.frame(txt=readLines(“PatientDatesNotes.txt”))
notes$txt[[1]]
[1] “03/25/93 Total time of visit (in minutes):”
notes$txt[[161]]
[1] “see 21 Oct 2007 Schroder Hospital discharge summaryViolent Behavior Hx of Violent Behavior: Yes”
notes$txt[[316]]
[1] “rBrookhaven outpatient program in Jun 1976- lost 40 lbs, couldn’t get out of bed; had been seeing a therapist at the time”
notes$txt[[477]]
[1] “1989 Family Psych History: Family History of Suicidal Behavior: Ideation/Threat(s)”
“`
The problem is that the dates are not consistent in all notes. Perhaps with 500 records, this could be done manually. Now imagine that this is 500 every day or thousands of records. Finding a way to automatically do this (other than making the physician manually input the dates in a consistent format in the future) would save a lot of time.
Let’s walk through steps for doing this.
First, let’s start with the first pattern that will be easiest to match. Let’s look for a date in the format “mm/dd/yyyy”.
“`r
pttrn1 <- regex(“\\d{1,2}/\\d{1,2}/\\d{4}”, ignore_case = TRUE, dotall=TRUE)
notes<-notes %>%
mutate(
dtmatched = case_when(
str_detect(txt, pttrn1) ~ str_extract(txt, pttrn1),
TRUE~NA
)
)
notes %>% summarize(sum(!is.na(dtmatched)))
sum(!is.na(dtmatched))
1 25
“`
We see that only 25 of 500 notes have this format. The code above sets the stage for the next steps. case_when is a conditional function that assigns a value based on different conditions. Here it says if the condition is true that the pattern is found in the text string, then extract the pattern. If the condition is false, then move on to the next condition. The only remaining condition (TRUE) means that if it is true that no other condition has been met, then set to NA. We can add new conditions to case_when to look for other patterns. Let’s make the first pattern a little more general.
“`r
pttrn1 <- regex(“\\d{1,2}[[:punct:]]\\d{1,2}[[:punct:]](\\d{2}|\\d{4})”, ignore_case = TRUE, dotall=TRUE)
notes<-notes %>%
mutate(
dtmatched = case_when(
str_detect(txt, pttrn1) ~ str_extract(txt, pttrn1),
TRUE~NA
)
)
notes %>% summarize(sum(!is.na(dtmatched)))
sum(!is.na(dtmatched))
1 125
“`
Now we are up to 125. Let’s add a different pattern.
“`r
pttrn1 <- regex(“\\d{1,2}[[:punct:]]\\d{1,2}[[:punct:]](\\d{2}|\\d{4})”, ignore_case = TRUE, dotall=TRUE)
pttrn2 <- regex(“\\w+ \\d{1,2}, \\d{4}”, ignore_case = TRUE, dotall=TRUE)
notes<-notes %>%
mutate(
dtmatched = case_when(
str_detect(txt, pttrn1) ~ str_extract(txt, pttrn1),
str_detect(txt, pttrn2) ~ str_extract(txt, pttrn2),
TRUE~NA
)
)
notes %>% summarize(sum(!is.na(dtmatched)))
sum(!is.na(dtmatched))
1 151
“`
Now we are up to 151. But now we have introduced a bad match:
“`r
notes$txt[[462]]
[1] “. Age 16, 1991, frontal impact. out for two weeks from sports.”
“`
We could try to restrict the matching a little more.
“`r
pttrn1 <- regex(“\\d{1,2}[[:punct:]]\\d{1,2}[[:punct:]](\\d{2}|\\d{4})”, ignore_case = TRUE, dotall=TRUE)
pttrn2 <- regex(“(Ja|Fe|Ma|Ap|Ma|Ju|Au|Se|Oc|No|De)\\w+ \\d{1,2}, \\d{4}”, ignore_case = TRUE, dotall=TRUE)
notes<-notes %>%
mutate(
dtmatched = case_when(
str_detect(txt, pttrn1) ~ str_extract(txt, pttrn1),
str_detect(txt, pttrn2) ~ str_extract(txt, pttrn2),
TRUE~NA
)
)
notes %>% summarize(sum(!is.na(dtmatched)))
sum(!is.na(dtmatched))
1 150
“`
Now we have eliminated the bad match.
We could continue to build our matches until we have a set we are happy with. We may still require occasional manual intervention; however, we could eliminate a large part of the manual work.
Once we have a set of dates in a similar order, we can convert them to dates with lubridate. If we can get the orders in a similar format, we can use lubridate to convert them to dates. Let’s try below for an example. Lubridate has functions that can convert dates based on an order (mdy,ymd,dmy,ym,my). So far we have the month-day-year order that we can use.
“`r
conv<-notes %>%
filter(!is.na(dtmatched)) %>%
mutate(
dtconv = mdy(dtmatched)
)
conv[c(1,134,135),]
txt dtmatched dtconv
1 03/25/93 Total time of visit (in minutes): 03/25/93 1993-03-25
134 .Got back to U.S. Jan 27, 1983. Jan 27, 1983 1983-01-27
135 September 01, 2012 Age: September 01, 2012 2012-09-01
“`
Here we can see that even with different format, these dates are now converted to the same date formats.
## Text as the primary data source – purpose
Sometimes we may want to capture specific information that text contains or otherwise summarize information from text. There are a broad set of tools developed to capture information from text. These are generally referred to as natural language processing (NLP) tools. These tools can be used to count the frequency of words or phrases in a text, to identify the sentiment of a text, to classify a text into categories, to predict outcomes, or to generate new text based on patterns in the text data. These tools are foundational for many of the text based applications that are used today such as spam filters, recommendation systems, chatbots, and more. These tools have evolved rapidly in recent years with the development of new models (transformer models) and the application of extremely large datasets and large amounts of computing power.
In this section, we will describe some of the foundational building blocks of NLP tools. We will then introduce simple applications of modern models.
### Foundations
The basic tools of language models are the same as other machine learning models. However, some applications have led to new or improved techniques specialized to these tasks. When working with text, language must be represented in the same way as in other applications, as numerical objects that are represented as row observations and columns that represent different features.
We will work through the building blocks to build an understanding of how these tools work. However, complete coverage of these topics cannot be done in a single chapter. We will therefore introduce the basics and then try some applications rather than going through all of the details or connecting all of the pieces.
#### Tokens
The first step working with text is to break the text into pieces called tokens (tokenization). Once we have individual tokens we can count their frequencies, find how they are related to other tokens, and combine them into numbers that capture various aspects of textual meaning. Various approaches have been taken related to creating tokens. Some of these choices have been prompted by the large amount of computing memory and power required from representing each piece of text as a separate number in the data. Here we will discuss standard approaches.
The first standard approach is to break text into individual words. Regular expressions are used to break the text in this way, usually by splitting text at every empty space or punctuation in the text. Some tokens are unlikely to contain meaning so the next step may be to remove uninformative text. This might include removing unusual punctuation, tags, spaces, or other items. Other tokens that are unlikely to contain information, called “stop words” may also be removed. In some cases, words may be reduced to their core meaning by removing endings such as “ed”, “s”, “ing” – called stemming. Related to stemming, words may be reduced to their lexical meaning, for example, “are”, “is”, “am” might be reducing to “be” – called lemmatization.
#### Embeddings
Earlier versions of NLP relied on counts of tokens to represent text meaning. However, word counts only roughly capture textual meaning. The next step is to capture the meaning of a token by representing how individual tokens are reltated to other tokens. Several approaches developed to represent meaning. These included n-grams — combining two or more tokens as a separate single token and collocation — counting how words occur in the same location. These methods for capturing the meaning of texts have developed into vector representations called embeddings. Embeddings are numerical representations of how an individual token is related to other tokens. Embeddings are based on a dictionary that lists all tokens that the embeddings will include. A typical simplified example is provided below. The table header shows the tokens tracted by the dictionary.
Token | Male | Leader | Royalty|
|—–|——-|——-|——-|
King | 1 | 1 | 1 |
Queen| 0 | 1 | 1 |
President | .5 | 1 | 0
|—–|——-|——-|——-|
The numbers in the table represent how related the token is to the words in the embeddings word vector. In some cases, these are probabilities or frequencies for how often the words coincide in the same text. In this example above, “King” is represented as “Male, Leader, Royalty”. “Queen” shares some similar attributes to “King” but is also different. In this example, the difference is the numerical representation of being not “Male”. President is unrelated to “Male”, is not “Royalty” but is “Leader”. Embeddings are typically much longer vectors, but they capture meaning in how words are related to the meaning. The embeddings for this example would be (1,1,1) for “King”, (0,1,1) for “Queen”, and (0.5,1,0) for “President”. These vectors can then be used numerically in much the same way as quantitative features in machine learning models. One of the many advantages of embeddings is that similar words can be treated in similar ways. For example, “King” and “Queen” might be treated as more closely related than “Lord” and “Peasant”.
Once a token has an embedding, this same approach can be applied to sentences, paragraphs, or documents. These are typically referred to as document embeddings. Document embeddings are some combination of the embeddings from the tokens in the document, for example, they might be the average of the token embeddings.
These embeddings are used for many tasks. For example, embeddings can be used to find similar or dissimilar words or documents. They might also be used to find topics and keywords. They might be used to search documents without the need to be precise about the specific terms or spellings of search terms.
#### Tasks
Text prepared to work in machine learning models has been used in various ways. Here we will describe a few of those ways. We will not describe the specific machine learning methods that make these possible. Related to prior chapters, these models are unsupervised and supervised and typically rely on variations of neural network models.
##### Document classification
One of the uses of NLP models is to classify a document in a way that the document may be put into a group or may be interpreted in a simple way. For example, the document might be classified as a legal document or a sales document or a document might be given a score like a sentiment score, an uncertainty score, or some other score. For example, a company may want to predict whether a written review represents a 1 star or a 5 star review.
These uses of NLP require a data set to train the model that has text that has a label, for example, positive or negative sentiment. The model can then be trained with the label as the Y variable (supervised model).
##### Summarization
Another use of NLP is to summarize text. Summarization can find topics or key sentences that best represent the text (unsupervised methods). Summarization could also be done by supervised models that are used to generate text (see below).
##### Text generation and large language models (LLMs)
An important set of NLP models that has dramatically altered NLP is the use of generative text models. These are models that use a large text corpus to predict tokens given a set of input tokens. These are trained by iteratively predicting new words in text. In this way, the new word to be predicted acts as the y variable to be predicted. Therefore, given a set of input tokens, the model can predict the most probably next word. The biggest innovation has been to use tremendously large training data sets of text, for example, all novels, twitter messages, and webpages – billions of documents. These models also have extremely large dictionaries/embeddings and potentially billions of parameters.
These large language models (LLMs) have changed how most users interact with text documents. Many of the tasks done with other tools previously are mostly now done with these LLMs. However, because of the large memory and computing requirements, “small language models” may still be necessary, for example, for use on cell phones or on less complex tasks.
The usefulness of LLMs comes from the ability to contain a large body of text that can then be called on to generate predicted text. These tasks are seemingly most successful for text that is predictable. However, because the models predict the most likely next word based on input words, the models do not do the same thing as searching for and returning known instances of similar documents. The probabilistic nature of the generated text also means that the models may not act in desirable ways. For example, LLMs may “hallucinate” by generating text that seems authoritative or factual even though it only exist in the “imagination” of the LLM. For instance, LLMs have generated citations and summaries of articles that are not real publications. Some model advancements improve some of the failures of earlier models, but these are ongoing projects that are being developed and revised continuously.
Below, we will discuss some of the current uses of LLMs. There are various tools for interacting with LLMs. In class, we will use one that we have access to through the university (you can log in to Microsoft Copilot with your tamu account). However, you may use other LLMs that you may have access to.
##### Uses of LLMs
Many of the same tasks that have been done with NLP tools seem to be reasonably successful with LLMs. The primary interaction with LLMs is through causing it to generate text based on input text. The input text is called a “prompt”. The emergence of LLMs has led to an entirely new set of skills called prompt engineering.
##### Prompt engineering
Prompt engineering is a set of strategies for causing generative text models to generate text in a way that helps the user. It is important to recognize that prompt engineering is not really about asking a model to return what we might want. There is no conscious cognition behind the model being queried. Prompt engineering is choosing combinations of words that can generate words that are the most likely consequence of the prompt text.
Despite the model nature of LLMs, approaches to prompting text generation reflect seemingly logical steps. For example, being specific, providing context, and giving examples seem to make prompting more successful. Below, we will explore some of the applications of LLMs and do so in a way to try to generate successful text responses.
##### Chat bots
Chat bots are perhaps the most visible instance of LLMs. Chat bots are models that predict the next word given input text. The input text begins with the user’s prompt and the model predicts the most logical next word. After generating the next word, this word becomes part of the new input to the model. The model then again predicts the next word and so forth.
Chat bots predict words based on the training data that has been used to create the model. Some chat bots are trained on textual data that have a particular objective. For example, some models have been trained on legal documents or literary text. Some of the “general” purpose chat bots have been trained on seemingly every conceivable source of text.
Businesses might train a model on their internal documents such as training material, customer service logs, technical material, frequently asked questions, or other documents. By training on their internal documents, the chat bot can be developed to respond with the most likely words within the context of these documents. For example, accounting firms have developed internal chat bots to help employees with routine questions and training.
Test a chat bot by using the following prompt.
“You are an accounting data analytics tutor. Explain in a step-by-step manner how an accountant could use a trial balance to detect fraudulent transactions.”
Using the above prompt, you might notice that the generated text provides some general responses that seem reasonable. However, you might also notice that some statements sound as if they are definitive treatments of the matter. Perhaps prompting in a different way can alter the reponses slighlty.
“You are an accounting data analytics tutor. Explain in a step-by-step manner how an accountant could use a trial balance to detect fraudulent transactions. Explain challenges that an accountant might encounter when performing the step.”
##### Retrieval augmented generation
LLMs, despite being large in their parameters, have limitations on the number of tokens they use to generate the next word. This is because the models have to have a fixed limit to the number of tokens used to make the next word prediction. An example of this problem might be if a user would like to prompt the model to respond to a prompt that includes a full legal contract, novel, or textbook. Without training a model specific to this document, the model cannot take the full document in as prompting text.
One way to circumvent these limits on text models is called retrieval augmented generation (RAG). This approach combines a generative model with a searching algorithm to find similar tokens within a large document. An example of the process is described below.
First, a user enters a prompt. Second, the RAG model pipiline turns the prompt into a document embedding for the prompt. Third, the document embedding from the prompt is compared with embeddings from pieces of the large document to find the pieces of the large document that are most similar to the prompt embedding. Last, the generative model then uses the prompt along with the retrieved pieces of the large document to generate a text response.
Various versions of user interfaces create the RAG process by allowing users to add a file to the prompt. Test a RAG model by doing the following.
Input the following into a chatbot:
“What is OWL and what happened in 2018 that increased ATVI’s investment value?”
Depending on the documents that the model was trained on, the generated response may not be helpful.
Now, load an [analyst’s report available here](https://www.dropbox.com/scl/fi/kkxp7fbhu93b4deltdgcf/AnalystReport10.pdf?rlkey=51ng6y3ms1vqozlgibnolq7i4&dl=0) into a chatbot. Use the following prompt to generate a response based on the document.
“What is OWL and what happened in 2018 that increased ATVI’s investment value?”
##### Extraction
The approaches above can be used to generate text that amounts to extracting desired text from a document. Use the analysts’ report above to extract the price target and stock rating.
Prompt: “What were ATVI’s stock rating and price target on January 10th, 2018? Provide the response in the following format. [Stock rating: ; Price Target: ]”
This approach doesn’t always work, but if text to generate is fairly structured within the document, this can be successful.
##### Summarizing
The same approach can be applied to summarizing text. Using the RAG above, try the following.
Prompt: “Summarize Morgan Stanely’s analyst report for ATVI.”
If the document is not too large, the same approach can be done within the chatbot’s prompt limits. For example, try the following.
Prompt: ”
You are a journalist that creates a short summary of news events. Below is a news event. Summarize the story in two short paragraphs.
What Happened? Today, Activision Blizzard announced a two-year deal with
Twitch (owned by Amazon), to stream the Overwatch League games (season
starts January 10th). Fans will be able to watch all regular season and playoff
games on Twitch. Twitch, the world’s largest website for streaming video games,
last reported 10mn DAUs in September of 2016 (with 106 minutes of daily
engagement). While the terms of the deal were not disclosed, Sports Business
Daily is reporting that Twitch is paying ATVI $90mn over 2 years for the rights
to stream OWL. We note that this deal is approximately equal to the deal Riot
Games (League of Legends publisher) made with BAMTech on a per year basis (7
years, $300mn). Sports Business Daily is also reporting that while Twitch will be
streaming all regular season and playoff games, ATVI’s Major League Gaming will
also stream half of the games. We view this positively as it gives ATVI the ability
to generate additional advertising revenues, and also potentially build MLG into
the ESPN of eSports (which we see as the bull case as we lay out in eSports: an
$8bn Call Option?). As discussed in the note, we expect the Overwatch League to
generate $100mn of gross revenue per year, with digital streaming
rights/advertising being the largest contributor with $32mn/year. We note that
the $45mn/year reported by Sports Business Daily is better than our base case.
While we are encouraged by the early results, we do not expect any material EPS
contribution from the Overwatch League as we expect ATVI to invest early
revenue and monetization into growing the success of the league.”
##### Agent models
One of the current efforts in using LLMs is developing tools that LLMs can interact with and perform work that might otherwise be tedious or technical. These agent models are models that are trained to generate a specific limited number of actions. These models are much like RAG models that search a database of possible actions and then find the closest matching action. For example, a model could be created to generate and run R code. To make an agent model work in this way, the actions that the agent can take have to be explicity specified. This might include providing options for sorting, summarizing, filtering, and mutating (i.e. dplyr functions). The model then has to take a user’s prompt, find the action that most closely matches the prompt, and then to initiate the action that has been specified.
Some of this can currently be done in part by building code with prompting. For example, [see this story](https://medium.com/firebird-technologies/building-auto-analyst-a-data-analytics-ai-agentic-system-3ac2573dcaf0). However, generative models cannot run the code itself in most instances. Agent models move closer to the point of only requiring a chat environment to do data analysis. For an example, [see a company working on this here](https://numbersstation.ai/introducing-meadow-llm-agents-for-data-tasks/), or [here](https://techcrunch.com/2024/10/24/anthropics-ai-can-now-run-and-write-code/). However, because the tasks that can be performed must be specified, these tools are still in their infancy. At some future point, you should be able to chat with a model to do everything we have done in this course!
## Review
In this chapter, we have surveyed a wide variety of tools for anayzing textual data. These tools include tools for working with individual strings with regular expressions, counting words, and identifying patterns. We have also introduced tools for natural language processing. These tools can be used to generate text to extract items, summarize documents, and search documents.
### Conceptual questions
1. Describe how regular expressions work.
2. Describe how regular expressions can be used to clean textual data, search text, and extract text.
3. Describe how word embeddings capture similarities between words.
4. Describe the conceptual process for building large language generation models.
5. List tasks that can be done with LLMs.
6. Explain limitations to using LLMs.
7. Explain the difference between a generative model and a retrieval augmented generation model.
### Practice questions
1. Using the 10-K document, count the number of times the word “strategy” is used.
2. Using the 10-K document, extract 50 characters around the first instance of the word “strategy”.
3. Using the 10-K document, extract the first instance of the word “strategy” and four words following “strategy” where “strategy” is preceded by a word with the root “finan”.
4. Using the physician notes data, extract dates that only have the month and the year where the month is a number, a word, or an abbreviation and the year is a four digit year.
5. Experiment with extracting items from a document, summarizing a document, and generating text from a document using a chatbot or RAG model.
## Solutions to practice questions
1. The following code could be used to do this.
“`r
str_count(txtstr, regex(“strategy”, ignore_case = FALSE, dotall=TRUE))
“`
2. The following code could be used to do this.
“`r
str_extract(txtstr, regex(“.{50}strategy.{50}”, ignore_case = FALSE, dotall=TRUE))
“`
3. The following code could be used to do this.
“`r
str_extract(txtstr, regex(“(?<=finan\\w{1,10} )strategy( \\w+){4}”, ignore_case = FALSE, dotall=TRUE))
“`
4. The following code could be used to do this.
“`r
pttrn <- regex(“(\\d{1,2}|Ja\\w+|Fe\\w+|Ma\\w+|Ap\\w+|Ma\\w+|Ju\\w+|Au\\w+|Se\\w+|Oc\\w+|No\\w+|De\\w+)(\\W+)\\d{4}”, ignore_case = TRUE, dotall=TRUE)
str_extract(notes$txt, pttrn)
“`
Tutorial video
Note: include practice and knowledge checks
Mini-case video
Note: include practice and knowledge checks