K Means clustering use case in security domain

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.
K Means:
The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”
A cluster refers to a collection of data points aggregated together because of certain similarities.
You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
How the K-means algorithm works:
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
- The centroids have stabilized — there is no change in their values because the clustering has been successful.
- The defined number of iterations has been achieved.
K-Means in cyber profiling:
The idea of cyber profiling is derived from criminal
profiles, which provide information on the investigation division to classify the types of criminals who were at the
crime scene.
Profiling is more specifically based on what is
known and not known about the criminal .
Profiling is information about an individual or group of
individuals that are accumulated, stored, and used for various
purposes, such as by monitoring their behavior through their
internet activity .
Difficulties in implementing cyber profiling is on the
diversity of user data and behavior when online is sometimes
different from actual behavior. Given the privilege in personal
behavior, inductive generalizations can be very reliable but can
also lead to a misunderstanding of behavior analysis. Therefore
the cyber-profiling process is via a combination of deductive
and inductive method.
Here I have mentioned in short about the case study of higher education in Indonesia regarding how they used K-Means for cyber profiling.
Research methods:
To determine cyber-profiling of the higher educational
institutions, so in this study the sample data is a log of Internet
activities from one educational institution. Log data do not only
contain any websites accessed by the user, but also includes
packets received and sent over the network traffic. Data
obtained containing the activities of network traffic for five
days and produce data as much as 320.773 records.

The above flow shows the flow of application of K-Means in profiling process.

This shows the flow of K-Means algorithm.
The results of this were as follows:

- Cluster 1:The first cluster has a data value in the
range of 1-10, because in this cluster of existing data
has a low level of traffic. Thus, cluster unity categorized
on the website that has the least traffic from another
cluster. - Cluster-2: In the second cluster, members who entered
at this cluster of some 126 records. The value of the
results of the second cluster is in the range 11-31. This
value indicates that the members of the second cluster
have a medium level visits, because it has a higher
value than the average value generated by clustering.
Thus, the second cluster of clusters categorized as
having moderate traffic levels. - Cluster-3: On the results of the third cluster, cluster
members who sign on as many as 33 records. The
results of this third cluster have the fewest number of
members in comparison with other clusters, but the
members of this cluster have the highest value of the
data that has been generated. The value in this cluster is
in the 34-63 range, pointing to a result that the third
cluster has a value far above average. Thus, the third
cluster is categorized as a cluster that has the highest
traffic level.
Analysis:
The first cluster have shown low levels of traffic, but has
some websites most. Data on the first cluster contains most of the advertising media website that coincided with a visit to a
website activity. Meanwhile, in the second cluster that has
moderate traffic levels, the data indicate a cluster member news
sites that are in this cluster.
On the results of the third cluster is a group of websites
with the highest traffic levels, but has the least number of
websites. Data in this cluster shows that social media is a
website with traffic levels were relatively high. Other data from the third cluster shows that Internet users access website search
engine more frequently than from other websites including
social media websites.
So this was the Analysis done on the clusters using the K-Means algorithm.
Hope you all would love this blog.😊😌
Thanks for reading ☺️