Anomaly Detection is the technique of identifying rare/abnormal events or observations which are different from the rest of the observations. Such “anomalous” behavior typically translates to a problem like credit card fraud, a failing machine in a server, a cyber attack, etc.
An anomaly can be broadly categorized into three categories –
Point Anomaly: A point in a dataset is said to be a Point Anomaly if it is far off from the rest of the data.
Contextual Anomaly: An observation or group of points is a Contextual Anomaly if it is an anomaly because of the context of the observation.
Collective Anomaly: A set of data instances that help in finding an anomaly.
Anomaly detection can be done by using Machine Learning. It can be done in the following ways –
Supervised Anomaly Detection: This method requires a labeled dataset containing both normal and anomalous samples to construct a predictive model to classify future data points. The most commonly used algorithms for this purpose are supervised Neural Networks, SVM(support vector machine), KNN(K-nearest neighbor)etc.
Unsupervised Anomaly Detection: This method requires a training dataset and instead assumes two things about the data ie Only a small percentage of data is anomalous and any anomaly is statistically different from the normal samples. Based on the above assumptions, the data is then clustered using a similarity measure and the data points which are far off from the cluster are anomalies.
We now demonstrate the process of anomaly detection on an attendance dataset using the K-means algorithm.
Importing python libraries:
import numpy as np import pandas as pd from sklearn import preprocessing import matplotlib.pyplot as plt
from sklearn.cluster import KMeans #K-Means Clustering from sklearn.decomposition import PCA #Principal Component Analysis from sklearn.manifold import TSNE #T-Distributed Stochastic Neighbor Embedding from sklearn.preprocessing import StandardScaler #used for 'Feature Scaling'
import plotly as py import plotly.graph_objs as go from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
After pre-processing the dataset we removed holidays to use K-means clustering and label encoded the data points.
Now we go for K-means clustering-
#initialize the class object kmeans = KMeans(n_clusters= 2) #predict the labels of clusters. #label = kmeans.fit_predict(kdf) #Fit our model kmeans.fit(kdf) #Find which cluster each data-point belongs to clusters = kmeans.predict(kdf) print(clusters) #Add the cluster vector to our DataFrame, X kdf["Cluster"] = clusters #PCA with two principal components pca_2d = PCA(n_components=2) #This DataFrame contains the two principal components that will be used #for the 2-D visualization mentioned above PCs_2d =pd.DataFrame(pca_2d.fit_transform(kdf.drop(["Cluster"], axis=1))) PCs_2d.columns = ["PC1_2d", "PC2_2d"]
now instructions for building the 2-D plot
#trace1 is for 'Cluster 0' trace1 = go.Scatter( x = cluster0["PC1_2d"], y = cluster0["PC2_2d"], mode = "markers", name = "Cluster 0", marker = dict(color = 'rgba(255, 128, 255, 0.8)'), text = None) #trace2 is for 'Cluster 1' trace2 = go.Scatter( x = cluster1["PC1_2d"], y = cluster1["PC2_2d"], mode = "markers", name = "Cluster 1", marker = dict(color = 'rgba(255, 128, 2, 0.8)'), text = None) data = [trace1, trace2] layout = dict(title = title, xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False), yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False) ) fig = dict(data = data, layout = layout) iplot(fig)
Data Visualization :
Now we saw that some points are very far from the other points. Those are the anomalies.