Unsupervised models: clustering and anomaly detection

Jeremiah Green

13 Unsupervised models: clustering and anomaly detection

Learning Objectives

Explain principles of unsupervised machine learning
Describe clustering and anomaly detection techniques including strengths and weaknesses
Describe accounting applications
Use R packages for clustering and anomaly detection

Chapter content

Unsupervised models are models that learn patterns from data without having a target variable that is to be predicted. Rather than trying to find a model that best predicts an outcome, unsupervised models have objectives related to explaining variation in the data.

This chapter introduces two types of unsupervised models: clustering and anomaly detection. The next chapter explores principal components analysis and feature reduction.

Clustering

Clustering is the process of grouping similar observations (rows) together. The goal of clustering is to find groups of observations that are similar to each other and different from other observations. Clustering methods require a metric for how similar a row in the data set is to other rows in the data.

Distance metrics

Measuring how observations are similar to other observations is done with columns/features in the data. Measuring similarity is done with a distance metric.

One distance metric, Euclidean distance, is shown below.

$\sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$

where $a_i$ and $b_i$ are the values of the $i$ th feature in observations $a$ and $b$ , respectively.

A simple example with three observations for calculating Euclidean distance is shown below.

Observation	Feature 1	Feature 2	Feature 3
A	1	2	3
B	2	3	4
C	4	6	5

The distance between observations A and B is calculated as follows:

$\sqrt{(1-2)^2 + (2-3)^2 + (3-4)^2} = \sqrt{1 + 1 + 1} = \sqrt{3}$

The distance between observations A and C is calculated as follows:

$\sqrt{(1-4)^2 + (2-6)^2 + (3-5)^2} = \sqrt{9 + 16 + 4} = \sqrt{29}$

The distance between observations B and C is calculated as follows:

$\sqrt{(2-4)^2 + (3-6)^2 + (4-5)^2} = \sqrt{4 + 9 + 1} = \sqrt{14}$

The features 1, 2, and 3 capture information about each observation. The distance gives a summary of how different those features are between observations. The smaller the distance, the more similar observations are. The larger the distance, the more different observations are.

In the example above, A and B are the most similar with a distance of $\sqrt{3}$ . A and C are the most different with a distance of $\sqrt{29}$ . B and C are in between with a distance of $\sqrt{14}$ .

k-means clustering

Clustering is the process of trying to find k number of groups that are most similar to each other. The k-means algorithm is a popular clustering algorithm that tries to find k number of groups that are most similar to each other. The clustering is done by maximizing the fit of all clusters. The fit of an individual cluster is given as the sum of the distance measures from the mean of the cluster. In the example above, if we assign A and C to a cluster, we first calculate the mean features for the cluster.

	Feature 1	Feature 2	Feature 3
Cluster mean	(1+2)/2 = 1.5	(2+3)/2 = 2.5	(3+4)/2 = 3.5

The distance between A and the mean is calculated as follows:

$d_A = \sqrt{(1-1.5)^2 + (2-2.5)^2 + (3-3.5)^2} = \sqrt{0.25 + 0.25 + 0.25} = \sqrt{0.75}$

The distance between B and the mean is calculated as follows:

$d_B = \sqrt{(2-1.5)^2 + (3-2.5)^2 + (4-3.5)^2} = \sqrt{0.25 + 0.25 + 0.25} = \sqrt{0.75}$

The fit, where higher numbers show worse fit, for the cluster is given by the sum of these distances.

$Cluster\ dist = d_A + d_B = \sqrt{0.75} + \sqrt{0.75} = 2\sqrt{0.75}$

The k-means algorithm tries to find the best fit for all clusters. The algorithm starts by randomly assigning observations to clusters. Then it calculates the mean of each cluster and assigns observations to the cluster with the closest mean. The algorithm continues to iterate until the total distance of all clusters is minimized.

More information about clustering is available in many places with some references at the end of the chapter.

Considerations for using clustering

Clustering can be a powerful tool; however, its success depends crucially on the data and how it is used. Three important considerations are described below.

Number of clusters

k-means clustering requires a choice for the number of clusters. When trying to unknown groups, you usually do not know the number of groups. Using too few clusters may not clearly separate the groups or may cause the groups to be dominated by one or a small number of features. Using too many clusters may cause the groups to be too small may cause the groups to be dominated by random variation in the data that doesn’t correctly identify the groups.

Choosing the number of clusters may also be an iterative process. You may start with a guess and then try to infer whether the groups are meaningful. You may then alter the number of clusters and re-run the clustering.

Included features

A distance measure at the heart of clustering relies on the features that are included in the distance calculation. Note that the features need to be numeric to be able to calculate the distance measure. If the distance calculation uses the “wrong” features then the clustering cannot find groups that are determined by the omitted features or it may find groups based on features that are not important.

Choosing which features to include may be an iterative process. You may start with all features, identify some groups, and then try to infer which features are most important for the groups. It is possible that the groups fail to provide new information about patterns in the data. You may then alter the features and re-run the clustering.

Another consideration in choosing features is the correlation of the features. If the features are highly correlated, then the distance measure will be dominated by the correlated features. In the extreme case, imagine that the distance measure includes 4 perfectly correlated features, i.e. the same feature repeated 4 times, and one uncorrelated feature. The distance measure will repeat the distance for the single feature 4 times and the uncorrelated feature one time.

Scale of features

Distance measures aggregate differences between observations for each feature. If the features are on different scales, then the distance measure may be dominated by the feature with the largest scale. For example, if a feature is measured in the thousands and another feature is measured in the tens, the distance measure will be dominated by the feature in the thousands.

If scale is an important determinant of groups, clustering on features of difference scale may be useful. When trying to separate groups by something other than scale, then features may need to be standardized so that they are on the same scale. The default approach in clustering is typically to standardize the features before clustering.

Clustering for accounting data

Clustering is an exploratory technique. It is used when there may be important groups of observations in a data set. This type of analysis can be useful when the groupings are unknown. Exploratory means looking for patterns in the data that might provide insights about the data that might be used later in the data analysis or predictive modelling.

Unknown groups in accounting data often relate to groups that generate accounting data. For example, groups of sales observations may come from different types of customers, customers with different tastes, or customers from different locations. Expenses may be generated by different types of producs and services. When comparing different companies, industries or strategies may create groupings. More specific to how accounting data is generated, accounting methods and choices may create groupings.

Anomaly detection

Closely related to clustering is anomaly detection. Anomaly detection might be thought of as the opposite of clustering. Clustering is about finding groups of observations that are similar. Anomaly detection is about finding observations that are different. For example, if we have a set of proposed clusters, anomalies might be identified as a small number of observations that are far away from even the nearest cluster. Anomaly detection is the identification of observations that are different from the rest of the data. Anomaly detection is related to identifying outliers by looking at a single feature, but also includes the combinations of features that may indicate that an observation is unusual.

Anomalies are interesting because they may represent errors, fraud, or other unusual events and may be difficult to detect because they are rare and may not be easily identified. Anomaly detection is often used in fraud detection, network security, and other areas where unusual events are important. For example, one way to detect credit card fraud is to look for unusual transactions (as compared with normal transactions, but note that supervised learning may also be used to detect credit card fraud).

There are different anomaly detection algorithms. h2o has a number of these available. References to some descriptions are provided at the end of the chapter. Many of the same considerations for clustering apply to anomaly detection techniques.

Anomaly detection for accounting data

There are various reasons anomaly detection can be useful for addressing accounting relevant questions or using accounting data. First, many accounting variables have skewed or extreme observations. Second, there are many instances where a small number of unknown observations are particularly important. For example, in fraud detection, we may want to identify a small number of fraudulent transactions. Third, anomalies may be important for understanding the data. For example, if a small number of observations are very different from the rest of the data, you may want to understand why these observations are different.

Example with h2o in R

The data for this chapter will be simulated data for sales transactions. The data set is the record of sales transactions: https://www.dropbox.com/scl/fi/tgvjfo5zpnc3nj63y6h2v/SalesTransactions2024.csv?rlkey=m03aolytfd1cgjpq7ipfb5ve9&dl=0.

This section will apply clustering and then anomaly detection techniques with h2o to the sales transaction data. There are based packages and other packages that can be used for clustering and anomaly detection. This chapter will use h2o versions to stay as close as possible to the code in prior chapters. The h2o version of k-means (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html) includes functions that can automate some of the steps, for example, estimating the best k to use.

Because clustering and anomaly detection are primarily exploratory analyses, they might be used at any point in the data analysis process starting with the raw values and including finalized, standardized data. Anomaly detection might be use to find outliers on training data before fitting a model. It might also be applied to predict anomalies on new data.

Data and environment set up

Preparing the data and environment includes loading necessary packages, setting up the h2o environment, and preparing and selecting features for analysis.

Load packages and initialize the H2O cluster.

library(h2o)
library(DALEX) library(DALEXtra)
library(tidyverse) library(parallel)

Initiate h2o.

h2o.init(nthreads=8)

Read in the sales transactions data set. In this data frame, CUSTID is the customer ID, PRODID is the product ID, MonthEnd is the last day of the month because invoices are sent out at the end of the month, Sales is the gross sales amount, NetSales is the sales amount after any discounts are applied, COGS is the cost of goods sold, ShipCost is the cost to ship the product (the company pays for shipping), UnitsReturned is the number of the product returned during the month (the product could have been purchased previously), and AccountBalance is the amount owed at the end of the month (note that this cumulates as each product is purchased, so the total balance at the end of the month is the amount after the last product is purchased).

Prepare and select features

Which features to use and how to prepare them depends on the reasons for the analysis, the types of groupings suspected in the data, and the features being considered. How to make these decisions may be learned through experience with the data set, understanding the related business setting, or trial and error. The code preparation below uses common preparation steps. While these should not be used as a mechanistic recipe, they are described here. First, create features that capture different aspects of the data, i.e. not too correlated. Second, for skewed positive only variables, transform the variable using the natural log. Third, for other variables, limit the influence of outliers. Fourth, standardize the variables.

Assuming the data frame has been imported as “df”, the following can prepare and select columns. The winstb and sdz functions are for simplifying the data preparation.

winstb <- function(x,p=0.02){
lim <- quantile(x,c(p,1-p),na.rm=T)
x[x<lim[1]] <- lim[1]
x[x>lim[2]] <- lim[2]
x}
sdz <- function(x){
(x-mean(x,na.rm=T))/sd(x,na.rm=T)}

df <- df%>%
mutate(
lnSales = log(Sales),
DiscountRate = (Sales-NetSales)/Sales,
GrossProf = (Sales-COGS)/Sales,
ShipSize = ShipCost/Sales,
ReturnSales = UnitsReturned/Sales,
AccountSales = AccountBalance/Sales) %>%
mutate(across(c(lnSales,DiscountRate,GrossProf,ShipSize,ReturnSales,AccountSales),winstb)) %>%
mutate(across(c(lnSales,DiscountRate,GrossProf,ShipSize,ReturnSales,AccountSales),sdz))

tmp <- df %>%
select(lnSales,DiscountRate,GrossProf,ShipSize,ReturnSales,AccountSales)

tmp.h2o <- as.h2o(tmp)

Clustering

Run the k-means clustering algorithm. Here the code starts with k=10. h2o will use cross-validation to create and check the clusters. If estimate_k = TRUE, h2o will estimate the best k to use.

km <- h2o.kmeans(k = 10,
estimate_k = TRUE,
standardize = FALSE,
seed = 1234,
training_frame = tmp.h2o)
km

The output describes how many clusters h2o chose and the fit of the clustering.

The groups are defined by the means of the features. The means of the features for each group are shown with the centers function.

h2o.centers(km)

To use the estimated clusters with the original data frame:

predicted <- h2o.predict(km, tmp.h2o)
tmp$cluster <- as.data.frame(predicted)$predict

Anomaly detection

There are different anomaly detection algorithms and any model can be used to find observations that do not fit the model. This code uses the extended isolation forest algorithm. Additional information is provided in the references as the end of the chapter.

Run the algorithm.

eif <- h2o.extendedIsolationForest(
training_frame = tmp.h2o,
model_id = "eif.hex")

Use the model to create an anomaly score for observations.

anomscore<- h2o.predict(eif, tmp.h2o)

tmp$AnomScore <- as.data.frame(anomscore$anomaly_score)$anomaly_score

The anomaly score is a measure of how different the observation is from the rest of the data with higher scores meaning more unusual.

Tutorial video

Conclusion

In summary, this chapter introduced the foundational concepts and practical applications of unsupervised machine learning, focusing on clustering and anomaly detection techniques. By exploring how clustering groups similar observations and how anomaly detection identifies outliers, the chapter highlighted the importance of distance metrics, feature selection, and data scaling in producing meaningful results. Practical considerations, such as choosing the number of clusters and preparing features, were discussed in the context of accounting data, demonstrating the value of these methods for uncovering hidden patterns and detecting unusual transactions. Through hands-on examples using R and the h2o package, readers gained insight into implementing these techniques for exploratory data analysis.

Review

Mini-case video

References

https://towardsdatascience.com/k-means-clustering-concepts-and-implementation-in-r-for-data-science-32cae6a3ceba

https://uc-r.github.io/kmeans_clustering

https://www.youtube.com/watch?v=4b5d3muPQmA

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/eif.html

13 Unsupervised models: clustering and anomaly detection

Chapter content

Clustering

Distance metrics

k-means clustering

Considerations for using clustering

Number of clusters

Included features

Scale of features

Clustering for accounting data

Anomaly detection

Anomaly detection for accounting data

Example with h2o in R

Data and environment set up

Load packages and initialize the H2O cluster.

Initiate h2o.

Prepare and select features

Clustering

Anomaly detection

Tutorial video

Conclusion

Review

Mini-case video

References

License

Share This Book