K-means Clustering & it’s Real use-case in the Security Domain.
What is Clustering?
The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in data science. Entities in each group are comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms and a comparison between two of the most commonly used clustering methods.
How Does K-Means Clustering Work?
The flowchart below shows how k-means clustering works:
The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.
Use-Cases in the Security Domain
1. Using Cluster Analysis for Comprehensive Threat Detection
Cyber-attacks are becoming stealthier and with the ability to remain dormant for long periods of time. Some of the ways malware stays hidden include changing their behavior dynamically and autonomously to avoid detection.It can dynamically generate new communication channels known to the malware and the attacker, or sacrifice compromised assets by triggering an alert and trick victimized companies and the security analysts into feeling safe — unaware that the threat persists on other more critical devices. These sophisticated attacks can only be identified through the use of more data. Unfortunately, the more data, the more difficult the analysis.
Analysts are overwhelmed as they wade through terabytes of logs to find, confirm, and mitigate cyber threats, so they often miss the signs and leave the organization open to attacks.
Cluster analysis, a type of unsupervised machine learning, enables companies to solve this problem. By uncovering hidden patterns and structures in large sets of data, it is able to identify indicators of compromise that remain hidden to analysts
2. Insurance fraud detection
Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
3. Call record detail analysis
A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
These were few use cases but the list goes on be it in Security Domain or any other, K-means is very effective as well as easy way of Clustering in Machine Learning.