Special Issue Article
Anomaly Detection on Data Streams with High Dimensional Data Environment
Classification consists of assigning a class label to a set of unclassified cases. Supervised and unsupervised classification methods are used to assign class labels. Classification is performed in two steps learning or training (model construction) and testing (model usage). Learning process is used to identify the class patterns from the labeled transactions. In training phase unlabeled transactions are assigned with the class values with reference to the learned class patterns. An outlier is an observation that deviates so much from other observations as to arouse suspicions. Distance based outlier detection methods are used to identify records that are different from the rest of the data set. The anomaly detection is referred as outlier detection process. Batch mode based anomaly detection scheme is not suitable for large scale data values. Batch mode scheme requires high computational and memory resources. Principal component analysis (PCA) is a unsupervised dimension reduction method. PCA determines the principal directions of the data distribution. online oversampling principal component analysis (osPCA) algorithm is used to detect outliers from a large amount of data via online. The over sampling based Principal Component Analysis (osPCA) method is enhanced to handle high dimensional data values. The learning process is improved to manage dimensionality differences. The system is tuned to handle data with multi cluster structure. The system is enhanced to perform anomaly detection on streaming data values.