**Anomaly Detection** is the technique of identifying rare/abnormal events or observations which are different from the rest of the observations. Such “anomalous” behavior typically translates to a problem like credit card fraud, a failing machine in a server, a cyber attack, etc.

An anomaly can be broadly categorized into three categories –

**Point Anomaly:**A point in a dataset is said to be a Point Anomaly if it is far off from the rest of the data.**Contextual Anomaly:**An observation or group of points is a Contextual Anomaly if it is an anomaly because of the context of the observation.**Collective Anomaly:**A set of data instances that help in finding an anomaly.

Anomaly detection can be done by using **Machine Learning**. It can be done in the following ways –

**Supervised Anomaly Detection:**This method requires a labeled dataset containing both normal and anomalous samples to construct a predictive model to classify future data points. The most commonly used algorithms for this purpose are supervised Neural Networks, SVM(support vector machine), KNN(K-nearest neighbor)etc.**Unsupervised Anomaly Detection:**This method requires a training dataset and instead assumes two things about the data ie Only a small percentage of data is anomalous and any anomaly is statistically different from the normal samples. Based on the above assumptions, the data is then clustered using a similarity measure and the data points which are far off from the cluster are anomalies.

We now demonstrate the process of anomaly detection on an attendance dataset using the K-means algorithm.

`Importing python libraries:`

```
import numpy as np
import pandas as pd
from sklearn import preprocessing
import matplotlib.pyplot as plt
```

`#sklearn imports`

```
from sklearn.cluster import KMeans #K-Means Clustering
from sklearn.decomposition import PCA #Principal Component Analysis
from sklearn.manifold import TSNE #T-Distributed Stochastic Neighbor Embedding
from sklearn.preprocessing import StandardScaler #used for 'Feature Scaling'
```

`#plotly imports`

```
import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
```

After pre-processing the dataset we removed holidays to use K-means clustering and label encoded the data points.

Now we go for K-means clustering-

```
#initialize the class object
kmeans = KMeans(n_clusters= 2)
#predict the labels of clusters.
#label = kmeans.fit_predict(kdf)
#Fit our model
kmeans.fit(kdf)
#Find which cluster each data-point belongs to
clusters = kmeans.predict(kdf)
print(clusters)
#Add the cluster vector to our DataFrame, X
kdf["Cluster"] = clusters
#PCA with two principal components
pca_2d = PCA(n_components=2)
#This DataFrame contains the two principal components that will be used
#for the 2-D visualization mentioned above
PCs_2d =pd.DataFrame(pca_2d.fit_transform(kdf.drop(["Cluster"], axis=1)))
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
```

now instructions for building the 2-D plot

```
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
x = cluster0["PC1_2d"],
y = cluster0["PC2_2d"],
mode = "markers",
name = "Cluster 0",
marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
x = cluster1["PC1_2d"],
y = cluster1["PC2_2d"],
mode = "markers",
name = "Cluster 1",
marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
text = None)
data = [trace1, trace2]
layout = dict(title = title,
xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
)
fig = dict(data = data, layout = layout)
iplot(fig)
```

Data Visualization :

Now we saw that some points are very far from the other points. Those are the anomalies.