Anomaly detection in Machine Learning

🚀 Welcome to TechLearn India, your go-to destination for insightful tech tutorials, coding tips, and all things related to web development! 🌐💻
Who We Are: TechLearn India is a passionate community dedicated to fostering knowledge and skill development in the ever-evolving world of technology. Our mission is to empower learners, beginners to seasoned developers, with the latest tools, frameworks, and best practices in the field of web development.
What We Offer: 📚 Tutorials & Guides: Dive deep into our step-by-step tutorials, designed to make complex concepts accessible and enjoyable.
💡 Coding Tips & Tricks: Stay ahead of the curve with our curated collection of coding tips and tricks, helping you optimize your workflow and write cleaner, more efficient code.
🌐 Tech Insights: Explore the latest trends, insights, and news in the tech industry. We keep you informed about the cutting-edge technologies that shape the digital landscape.
Why TechLearn India: At TechLearn India, we believe in the transformative power of learning. Whether you're a beginner or a seasoned developer, our content is tailored to inspire, educate, and elevate your skills. We're committed to creating a supportive and inclusive learning environment for all tech enthusiasts.
Anomaly Detection is the technique of identifying rare/abnormal events or observations which are different from the rest of the observations. Such “anomalous” behavior typically translates to a problem like credit card fraud, a failing machine in a server, a cyber attack, etc.
An anomaly can be broadly categorized into three categories –
Point Anomaly: A point in a dataset is said to be a Point Anomaly if it is far off from the rest of the data.
Contextual Anomaly: An observation or group of points is a Contextual Anomaly if it is an anomaly because of the context of the observation.
Collective Anomaly: A set of data instances that help in finding an anomaly.
Anomaly detection can be done by using Machine Learning. It can be done in the following ways –
Supervised Anomaly Detection: This method requires a labeled dataset containing both normal and anomalous samples to construct a predictive model to classify future data points. The most commonly used algorithms for this purpose are supervised Neural Networks, SVM(support vector machine), KNN(K-nearest neighbor)etc.
Unsupervised Anomaly Detection: This method requires a training dataset and instead assumes two things about the data ie Only a small percentage of data is anomalous and any anomaly is statistically different from the normal samples. Based on the above assumptions, the data is then clustered using a similarity measure and the data points which are far off from the cluster are anomalies.
We now demonstrate the process of anomaly detection on an attendance dataset using the K-means algorithm.
Importing python libraries:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import matplotlib.pyplot as plt
#sklearn imports
from sklearn.cluster import KMeans #K-Means Clustering
from sklearn.decomposition import PCA #Principal Component Analysis
from sklearn.manifold import TSNE #T-Distributed Stochastic Neighbor Embedding
from sklearn.preprocessing import StandardScaler #used for 'Feature Scaling'
#plotly imports
import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
After pre-processing the dataset we removed holidays to use K-means clustering and label encoded the data points.
Now we go for K-means clustering-
#initialize the class object
kmeans = KMeans(n_clusters= 2)
#predict the labels of clusters.
#label = kmeans.fit_predict(kdf)
#Fit our model
kmeans.fit(kdf)
#Find which cluster each data-point belongs to
clusters = kmeans.predict(kdf)
print(clusters)
#Add the cluster vector to our DataFrame, X
kdf["Cluster"] = clusters
#PCA with two principal components
pca_2d = PCA(n_components=2)
#This DataFrame contains the two principal components that will be used
#for the 2-D visualization mentioned above
PCs_2d =pd.DataFrame(pca_2d.fit_transform(kdf.drop(["Cluster"], axis=1)))
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
now instructions for building the 2-D plot
#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
x = cluster0["PC1_2d"],
y = cluster0["PC2_2d"],
mode = "markers",
name = "Cluster 0",
marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
text = None)
#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
x = cluster1["PC1_2d"],
y = cluster1["PC2_2d"],
mode = "markers",
name = "Cluster 1",
marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
text = None)
data = [trace1, trace2]
layout = dict(title = title,
xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
)
fig = dict(data = data, layout = layout)
iplot(fig)
Data Visualization :

Now we saw that some points are very far from the other points. Those are the anomalies.




