XIMNET — Digital Agency — An illustration of MnM Home Whimsical Houses located in Kedah, Malaysia
An illustration of MnM Home Whimsical Houses located in Kedah, Malaysia

Understanding classification and clustering in machine learning

A simple example of how this approach can help in building dynamic page content for a better user experience


Artificial Intelligence (AI) and Machine Learning courses often get started by introducing the basics of data science to new learners, namely supervised and unsupervised learning.

Supervised learning is an approach that is defined by its usage of labeled data. These data are responsible for “supervising” the algorithm in classifying the data into meaningful categories. Unsupervised learning on the other hand uses machine learning algorithms to cluster unlabelled data sets to provide insights. Notice that the usage of both italic verbs here, classify and cluster. Both words similarly indicate grouping in normal sentences but have a huge difference in the machine learning context.

Supervised Learning: Classification

In supervised learning, classification learns from the provided labeled data and makes class predictions on the data. It serves as a predictive model which utilizes the available labeled data (x), approximating a mapping function, (f) to discrete output variables (y). It is also common for the classification approach to predict a continuous value as a probability of an object/item belonging to a specific class.

Taking a spam mail classification scenario, based on statistics, the word “promotion” is assigned 0.3 likelihood of being “spam” and 0.7 as being “not spam”. Normally the mail would then be classified as the higher likelihood label unless there are other parameters specified. As long as the results ended up with a discrete class label for each category, then it can be considered as a classification problem.

An example of the process flow for determining the suitable algorithm

The common classification algorithm includes the Logistic regression, Naive Bayes, K-Nearest Neighbours, etc. Each algorithm has its pros and cons, along with its assumption on the input data, which requires detailed consideration of the type of data being used and the computing resources available. Some algorithms work well with data that is independent of other variables, some work well with large training data sets and some can output a prediction faster compared to others which are highly suitable for real-time prediction applications.

To get a clear winner among these choices, one often performs trial and error on suitable algorithms and compare the accuracy, which computes based on the ratio of correctly predicted observation to the total observations. You may find out more in this article for a detailed explanation of each algorithm and recommended usage.

Unsupervised Learning: Clustering

As for clustering tasks, due to the absence of labeled data, it involves automatically detecting natural grouping among the input data. This is often applied when there are a lot of variable columns in the data, which labeling them into specific classes before training is not recommended. This will end up overlooking some importance of a feature to the overall insights.

Hence it is advisable to use clustering when we are expecting to learn more info about the datasets, especially getting insights from web marketing leads.

K-means clustering data visualization by Chris Diana.

Some of the common algorithms used to perform clustering include KMeans, Mean-Shift, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), etc. KMeans would be the most picked algorithm, especially for beginners due to its straightforward concept. It involves determining the number of classes to use and randomly placing the center points in the unstructured and random input data.

Then the data point is clustered by computing its distance with the center point, where the least distance would then be grouped under one cluster. The process is then repeated by recomputing the cluster center point by taking the mean vectors in the group until reaching the iteration where group centers do not alter much.

Despite the simplicity, the are certain disadvantages of clustering algorithms we should be aware of. For instance, it is always trivial to determine the number of classes to start with KMeans despite having available methods such as Elbow Method and Silhouette Analysis. The strategies used by the clustering algorithms are sometimes not necessarily related to the data feature and variable, but instead just the positioning of the data relative to the whole group. Some do not work well with potential outliers which makes it a dilemma when choosing it since unstructured data is often hard to break down to exclude the anomalies.

However, it is always recommended to start with clustering unstructured data to get to know the nature and correlation of the data to gain some insights. Thereafter, you may discover which approach and algorithm to go with.

A mixture of both

A mixture of both classification and clustering algorithms is a common strategy researchers take when they are dealing with complex and unstructured data.

However, a thorough study should be conducted beforehand to ensure the combination does produce a meaningful result as a whole that meets the objective of the study.

| time_visited | page_path |
| 2021-05-11, 8.00 A.M. | / | {Home)
| 2021-05-11, 8.16 A.M. | /our-new-brand-story |
| 2021-05-11, 8.23 A.M. | /about-ximnet-malaysia |
| 2021-05-11, 8.40 A.M. | /services/xtopiaio-dev-platform |
| 2021-05-11, 8.50 A.M. | /projects-case-studies |
| 2021-05-11, 9.05 A.M. | /get-in-touch |

Taking an application of machine learning study as an example, where we are interested in exploring the user journey of visitors from a collection of web pages. For clustering users into meaningful categories, we have zero knowledge on what type of user parameters will be the key feature in determining the cluster.

Hence it is better to include more relevant fields in the study to ensure each aspect is covered. For example, we are interested to know more about users browsing behavior from the data. With the help of Google Analytics, we can include data columns like date visited, page path visited by users, user-generated GUID (to differentiate distinct users), session duration and so on.

The table above only shows one column out of many from the data, where the page path represents the page that a specific user had visited in one session. This column alone does not bring any meaningful insights to the data as a whole. User A visiting the home page is not bringing enough evidence that he/she is a returning user or not.

{home}|our-new-brand-story|about-ximnet-malaysia|services/xtopiaio-dev-platform|projects-case-studies|get-in-touch** "|" acting as the separator

To counter that, we then group each session ( less than 30 minutes idle time ) into one row of data to construct a new column titled page path sequence. To elaborate, User A may record more than one row of data in the collections, but each row may describe different browsing behavior which depends on the page path. Instead of one, why not explore which page they trying to visit for the entire session?

To build a generalized user behavior classification engine, we need more than this. More data from different pages and websites should be included. Imagine different websites have all kinds of terms and words in page path, so we would end up with thousands of rows of data with the page path sequence column being highly dissimilar. For example, the contact us page for Website A might have a relative page path like “/contact-us”, whereas Website B might have something like “/enquire-more”.

This would cause trouble as we are adding more dimensionality to the overall data, which will lead to the curse of dimensionality. The classification model would experience difficulties distinguishing any similarity between each data point, which then puts each row into a different cluster. In the end, the cluster number would increase with the number of data rows we have, which is not feasible.

A solution to this potential problem (inspired by this article) is to first build another classification engine for the page path itself. This is possible by utilizing the feature called KeywordProcessor which is available in the flashtext python package. The main logic is to associate certain keywords to a particular category, matching those keywords with the page path and finding the class with the maximum Matching _value.

Matching_value = (Number of keywords matched with one page type)/(Total number of keywords matched)

An example of the list of keywords for each page class:

keywords_AboutUs = ["about-us", "about","organisation", "staff", "team", "who", "join", "careers"]keywords_ContactUs = ["contact", "reach-us","contact-us", "email", "phone", "know", "more"]keywords_Event_Happening_News = ["whats on", "news", "berita", "cerita", "updates", "terkini", "events", "press-release", "apply", "media", "january", "february", "march", "april", "may", "june", "july", "august", "september", "october", "november", "december", "newsletter"]keywords_Miscellaneous = ["terms", "covid", "faq", "feedback", "help", "legal", "privacy", "whistleblowing", "safety", "sustainability", "policy", "subscribe", "improve", "social", "unsubscribe"]** For Homepage, assumptions had been made where page path with relative path is "/" is classified as HomePage** Keywords that does not exist in these list is then classified as Business/Product page

By assigning these page classes to the page path_sequence, it can be seen that the data are much more structured, no matter if the collection of data includes pages from other websites or not.

Homepage|AboutUs|AboutUs|Business/Product|Business/Product|ContactUs** "|" acting as the separator

By having this extra column, combined with other columns (session_duration, device_category, geolocation) serving as the key features, we then can use clustering algorithms like KMeans to perform unsupervised classifications. Since we are working with mixed typed data, where numerical and categorical input data exists, it is recommended to also try out other algorithms to compare the performance, such as Gower Distance and KMedoid.


It is always a good practice to spend more time attempting multiple algorithms to clean and pre-process the data before feeding it into the final user clustering task. This is to ensure the classifications are based on valid features that are meaningful to the study. The example shown above is just a simple concept of how mixing both supervised (classification) and unsupervised (clustering) algorithms works.

Having the user behavior clustered would then serve as a strong foundation in driving towards the goal of building dynamic content webpages. Web contents can then be displayed dynamically based on user behavior, whether they are new users, returning users, or even potential investors for the company.

XIMNET is a digital solutions provider with two decades of track records specialising in web application development, AI Chatbot and system integration.

XIMNET is launching a brand new way of building AI Chatbot with XYAN. Get in touch with us to find out more.




Digital Agency in Malaysia | www.ximnet.com.my

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

From local Jupyter Notebooks to AWS Sagemaker.

Heartbeat Newsletter — Vol. 27

Top Machine Learning Algorithms

IntroductioDeciphering cognitive processes using NeuroImaging — Computational NeuroScience

An Advanced Example of the Tensorflow Estimator Class

Week 1 — Le recommandeur

An Introduction to Artificial Neural Networks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Digital Agency in Malaysia | www.ximnet.com.my

More from Medium

Amazon Review Sentiment Detection using Naive Bayes Classifier

Model Performance and Confusion Matrix in Machine Learning

Performing Analysis of Meteorological Data

Introduction to Machine Learning