Clustering in Machine Learning

Lauren Mai
11 min readOct 24, 2020
Source: geeksforgeeks.com

A. FOUNDATION KNOWLEDGE

1. Introduction to Clustering

I think the best explanation about K-Means is HERE or:

1.1 What is clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)¹. (Source: Wikipedia)

To sum up, basic characteristics of clustering are:

  • (It is) Exploratory data analysis.
  • (It is) Unsupervised learning method.

1.2 Why is clustering important?

  • Meaningfulness: clusters expand domain knowledge¹⁴. It helps us to make sense of and extract value from large sets of structured and unstructured data⁸. For example, in the medical field, researchers applied clustering to gene expression experiments. The clustering results identified groups of patients who respond differently to medical treatments.¹⁴
  • Usefulness: clusters, on the other hand, serve as an intermediate step in a data pipeline¹⁴. It helps us to take a sweeping glance of your data en masse, and then form some logical structures based on what you find there before going deeper into the nuts-and-bolts analysis.⁸

If you talk about the nuts and bolts of a subject or an activity, you are referring to the detailed practical aspects of it rather than abstract ideas about it.

Applications of Clustering:⁹

  • Marketing : It can be used to characterize & discover customer segments for marketing purposes.
  • Biology : It can be used for classification among different species of plants and animals.
  • Libraries : It is used in clustering different books on the basis of topics and information.
  • Insurance : It is used to acknowledge the customers, their policies and identifying the frauds.
  • City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present.
  • Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones.

1.3 Classification Vs. Clustering⁸

Classification:

  • Before you start, you already know the number of classes into which your data should be grouped and you already know what class you want each data point to be assigned.
  • The data in the dataset being learned from is labeled.

Clustering:

  • Have no predefined concept for how many clusters are appropriate for your data, and you rely upon the clustering algorithms to sort and cluster the data in the most appropriate way.
  • With clustering techniques, you’re learning from unlabeled data.
Source: Stack Overflow

1.4 What are clustering method?

1.4.1 Density-Based Methods¹⁰

Source: Datanovia
  • Determines cluster assignments based on the density of data points in a region. Clusters are assigned where there are high densities of data points separated by low-density regions.¹⁴
  • This approach doesn’t require the user to specify the number of clusters. Instead, there is a distance-based parameter that acts as a tunable threshold. This threshold determines how close points must be to be considered a cluster member.¹⁴
  • Points that are not part of a cluster are labeled as noise.¹⁰

Example:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

OPTICS (Ordering Points to Identify Clustering Structure)

DENCLUE

Strengths vs Weaknesses

Strengths

  • They excel at identifying clusters of nonspherical shapes.
  • They’re resistant to outliers.

Weaknesses

1.4.2 Hierarchical-Based Methods

Source: Saedsayad
  • Involves creating clusters that have a predetermined ordering from top to bottom.¹³
  • Similar to partitional clustering, in hierarchical clustering the number of clusters (k) is often predetermined by the user.
  • Unlike many partitional clustering techniques, hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run an algorithm twice on the same input data.¹⁴

Agglomerative method is the bottom-up approach. It merges the two points that are the most similar until all points have been merged into a single cluster.¹⁴

Divisive method is the top-down approach. It starts with all points as one cluster and splits the least similar clusters at each step until only single data points remain.¹⁴

There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumstances but is conceptually more complex.

Strengths vs Weaknesses¹⁴

Strengths:

  • They often reveal the finer details about the relationships between data objects.
  • They provide an interpretable dendrogram.

Weaknesses:

  • They’re computationally expensive with respect to algorithm complexity.
  • They’re sensitive to noise and outliers.

1.4.3 Partitioning Methods

  • The simplest and most fundamental version of cluster analysis is partitioning, this clustering method classifies the information into multiple groups based on the characteristics and similarity of the data.
  • These techniques require the user to specify the number of clusters, indicated by the variable k.
  • Data objects is divided into nonoverlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object.¹⁴
  • These algorithms are both nondeterministic, meaning they could produce different results from two separate runs even if the runs were based on the same input.¹⁴
  • There are many algorithms that come under partitioning method. Some of the popular ones are: K-Mean, PAM (K-Mediods), CLARA algorithm (Clustering Large Applications)…

Strengths vs. Weaknesses¹⁴

Strengths

  • They work well when clusters have a spherical shape.
  • They’re scalable with respect to algorithm complexity.

Weaknesses

  • They’re not well suited for clusters with complex shapes and different sizes.
  • They break down when used with clusters of different densities.

1.4.4 Grid-Based Methods

Summary of Clustering Methods (Source: Science Direct)

2. Introduction to K-Means

What is K-Means Algorithm?

  • K-Means is considered as one of the most used clustering algorithms due to its simplicity.²
  • K-Means is used when we have unlabeled data(i.e., data without defined categories or groups), but still have features. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K

With supervised learning, you have features and labels. The features are the descriptive attributes, and the label is what you’re attempting to predict or forecast.⁴

  • The approach K-Means follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster.

What are applications of K-Means?

Source: Oracle Blog. Introduction to K-means Clustering. (Link)

When is K-Means applied?

K-means can typically be applied to data that has a:

  • smaller number of dimensions.
  • is numeric.
  • is continuous.

B. LET’S MAKE IT HAPPEN

Here is the important part, it is kind of easy to do research on this topic but eventually, we always struggle when it comes to the execution.

And I know, there are countless ways to run clustering and it is impossible to cover all the cases. That’s why I hope this article can attract many feedbacks from someone who have hand-on experiences.

Now, I think the very first question that need to be asked is:

How many steps does it take to run clustering (aka Clustering Workflow)?¹⁴

The best answer comes from this course of Google, I will summarize the info down below but you can always find it here:

Step 1: Prepare Data.

  • Normalizing Data
  • Using the Log Transform
  • Using Quantiles

You must ensure that the prepared data lets you accurately calculate the similarity between examples by normalize, scale, and transform feature data.

Step 2: Create similarity metric.

Before a clustering algorithm can group data, it needs to know how similar pairs of examples are. You quantify the similarity between examples by creating a similarity metric. Creating a similarity metric requires you to carefully understand your data and how to derive similarity from your features.

Step 3: Run Clustering Algorithm.

A clustering algorithm uses the similarity metric to cluster data. This guide focuses on k-means.

Step 4: Interpret results and adjust your clustering.

Checking the quality of your clustering output is iterative and exploratory because clustering lacks “truth” that can verify the output. You verify the result against expectations at the cluster-level and the example-level. Improving the result requires iteratively experimenting with the previous steps to see how they affect the clustering.

According to my experience, I cannot emphasize more on the importance of step 1 & 4.

To sum up the painful lesson, I would say that: “When doing “machine learning”, 90% of time is for data preparation”. Where does those 90% come from? Well, I make up that number. Nonetheless, my main point here is: if you think the main part of running clustering is the algorithm (or step 3), you should better rethink about that.

Even though I already noticed this point at the beginning, but my data prep was not good enough. The result of K-Means turned out to be messy and I spent a long long time to portrait the group characters BUT in the end, I didn’t happy with the result though.

So you guys, If you read this article, I hope you guys will always keep in mind that: “Algorithm is not everything”.

Besides that, for anyone who never run K-Means before, please noted that the K-Means result is just the number. For example: If you want to cluster your data into 5 groups, the returned results just classify which object belong to which group. If you want to know the differences or similarities between those groups, you better do it yourself, explore it yourself.

That’s why step 1 and step 4 have close relationship to each others. That’s why never ever underestimate step 4 either.

Before we deep down into the detailed guideline of each steps, I would like to illustrate the execution by using this example (it was my very first project in clustering also).

We are pharmacy franchise (retail stores) and we want to understand the customer behaviors toward a specific product category: condom (interesting huh?). So I decided to use buying consumer data in the last 3 years to train.

Okay, let’s go!

Step 1: Prepare Data

Source: bresslergroup.com

If you are new to this topic, please spend your time review this article “Introduction to Transforming Data” from Google, it’s awesome:

1.1 Why Data Transformation?

Mandatory transformations for data compatibility.

  • Converting non-numeric features into numeric (Because you can’t do matrix multiplication on a string).
  • Resizing inputs to a fixed size.

Optional quality transformations that may help the model perform better

  • Normalized numeric features (most models perform better afterwards).
  • Allowing linear models to introduce non-linearities into the feature space.

Strictly speaking, quality transformations are not necessary — your model could still run without them. But using these techniques may enable the model to give better results.

1.2 Where to Transform?

You can apply transformations either while generating the data on disk, or within the model.

But be careful, each options have their own pros and cons as well. Make sure to read it the article above before choosing which one.

1.3 How to transform?

➤ There is 2 main type of data that need to transform: Numeric and Category.

For transformation of numeric data, you can use normalization or bucketing method:

  • Normalizing — transforming numeric data to the same scale as other numeric data.
Source: Developers (Google)
  • Bucketing — transforming numeric (usually continuous) data to categorical data.

For transformation of categorical Data, please read the article below:

➤ After that, you should consider to remove outlier. Their is still some debate conversation about should we remove outlier before or after running K-Means. In my opinion, outlier detection should be included in the first stage. Please share your thought if you think otherwise, it is very helpful.

1.4 Which works are involved in this step?

Here is the truth: Data is rarely happens to be available there for you, whenever you need.

So before performing any transformations on data, make sure you have data first (it might sound stupid but people often forget this part when estimate workload).

1.5 In my real-life project, what happened?

  • My very first task is come up with basic features (variables) of condom products. In some company, this data might already be available but it is not my case. So I listed down all the features and present the data in Yes/No or 1/0 type. (Example below)
  • Buying consumer data is another one that I need to pay attention to. I need to examine if my data is bias or not, if it is qualify for next step. As Google Course suggest, we should: (1) Examine several rows of data, (2) Check basic statistics, (3) Fix missing numerical entries…

After finishing this step, the data is ready to be put on the table for transformation.

References

[01] Wikipedia | Cluster analysis (Link)

[02] Towards Data Science |K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks (Link)

[03] Oracle Blog | Introduction to K-means Clustering (Link)

[04] Python Programming | Regression — Features and Labels (Link)

[05] Stanford | CS221 (Link)

[06] Codecademy| Normalization (Link)

[07] Developers (Google) | Clustering in Machine Learning (Link)

[08] Dummies | The Importance of Clustering and Classification in Data Science (Link)

[09] GeeksForGeeks | Clustering in Machine Learning (Link)

[10] ArcGIS Pro| How Density-based Clustering works (Link)

[11] Springer Link| Density-Based Clustering (Link)

[12] Datanovia | DBSCAN: Density-Based Clustering Essentials (Link)

[13] Saedsayad | Hierarchical Clustering (Link)

[14] Real Python | K-Means Clustering in Python: A Practical Guide (Link)

[15] Developers (Google) | Prepare Data (Link)

[1] | (Link)

[1] | (Link)

[1] | (Link)

[1] | (Link)

[2] | (Link)

--

--

Lauren Mai

Adding more values to my self-serving actions by sharing solutions to people in need for free.