Detecting outliers and annotating their types with indexing structures

Silva, Guilherme Domingos Faria

doi:10.11606/D.55.2022.tde-15092022-141353

Home

Facilities

Master's Dissertation

DOI

https://doi.org/10.11606/D.55.2022.tde-15092022-141353

Document

Master's Dissertation

Author

Silva, Guilherme Domingos Faria (Catálogo USP)

Full name

Guilherme Domingos Faria Silva

E-mail

Institute/School/College

Instituto de Ciências Matemáticas e de Computação

Knowledge Area

Computer Science and Computational Mathematics

Date of Defense

2022-07-15

Published

São Carlos, 2022

Supervisor

Cordeiro, Robson Leonardo Ferreira (Catálogo USP)

Committee

Cordeiro, Robson Leonardo Ferreira (President)
Moro, Mirella Moura
Razente, Humberto Luiz
Traina Junior, Caetano

Title in English

Detecting outliers and annotating their types with indexing structures

Keywords in English

Anomalies
Indexing structures
Outlier annotation
Outlier detection
Slim-tree

Abstract in English

The constant increase in the amount of data available on the internet is accentuated with the popularization of technologies such as 5G and Internet of Things. In datasets of large volume there is usually a strong presence of outliers that are not detected or that are just discarded. The outlier detection literature demonstrates that the investigation of these singular instances can provide new insights into the behavior of systems and people. This inspection allows diseases to be identified early, financial market trends to be better interpreted and cybersecurity attacks to be prevented. However, outlier detection techniques carry limitations, being: (1) dependent on the availability of the instances features, which can generate privacy issues; (2) poorly scalable and; (3) capable of providing only a binary separation that allows detecting outliers, but not classifying them so that they are better understood. Starting from an unlabeled dataset for which only the distances between the instances are available, how to detect outliers and categorize them by type efficiently? In the vast literature on outlier detection, there is no work, as far as we know, that deals with the problem of annotating outliers. Outliers can be classified into three large groups: (a) global outliers, instances that are severely different from others in the dataset, such as errors during insertion of information into a database; (b) local outliers, instances that, despite being similar to the others in the dataset as a whole, have minimal variations that make them different in a smaller scope, for instance, a football player who makes many mistakes while playing in a strong team and; (c) collective outliers, small groups of instances that are, simultaneously, quite different from the rest, such as a denial-of-service cyberattack, with few machines having similar harmful behavior. In this project we introduce C-ALLOUT: a new method for detecting outliers that is also able to categorize them by type. C-ALLOUT is able to maintain itself in terms of equality, or even superiority, when compared to state-of-the-art algorithms, still contributing with the annotation of outliers, a task that competitors are not able to perform. C-ALLOUT is based on Slim-tree, an indexing structure that makes it scalable, with O(nlogn) complexity of time and space. Our proposed method deals with both scenarios: having the features available or limited to distances only. Finally, C-ALLOUT works without depending on any interaction with the user, being parameter-free by default, the ideal for unsupervised tasks like outlier analysis.

Title in Portuguese

Detectando anomalias e anotando seus tipos com estruturas de indexação

Keywords in Portuguese

Anotação de anomalias
Casos de exceção
Detecção de anomalias
Estruturas de indexação
Slim-tree

Abstract in Portuguese

O aumento na quantidade de dados disponíveis na internet se acentua com a popularização de tecnologias como 5G e Internet das Coisas. Em grandes bases de dados costuma haver forte presença de anomalias não detectadas ou apenas descartadas. A literatura de detecção de anomalias demonstra que a investigação dessas instâncias singulares pode fornecer novas perspectivas sobre os comportamentos de sistemas e pessoas. Essa inspeção permite que doenças sejam identificadas precocemente, tendências do mercado financeiro sejam melhor interpretadas e que ataques de segurança digital sejam impedidos. Contudo, as técnicas de detecção de anomalias carregam limitações, sendo: (1) dependentes da disponibilidade dos atributos das instâncias, o que pode gerar problemas de privacidade; (2) pouco escaláveis e; (3) capazes de prover apenas uma separação binária que permite detectar anomalias, mas não classificá-las para que sejam melhor entendidas. Partindo de um conjunto de dados não rotulados para o qual apenas distâncias entre as instâncias estão disponíveis, como detectar anomalias e categorizá-las por tipo de forma eficiente? Na vasta literatura de detecção de anomalias não há trabalhos, até onde sabemos, que lidam com o problema de anotação de anomalias. Anomalias podem ser classificadas em três grandes grupos: (a) anomalias globais, instâncias severamente diferentes das outras, como erros de inserção de informações em uma base de dados; (b) anomalias locais, instâncias parecidas com algumas outras da base de dados como um todo, mas com variações mínimas que as tornam diferentes em um escopo menor, como um jogador de futebol que acerta poucas jogadas jogando em um time forte e; (c) anomalias coletivas, pequenos grupos de instâncias que são, em conjunto, bastante diferentes das restantes, como um ataque cibernético por negação de serviço, com poucas máquinas tendo um comportamento nocivo semelhante. Neste projeto é apresentado o C-ALLOUT: um novo método para detecção e anotação de anomalias. C-ALLOUT é capaz de se manter em nível de igualdade, ou até superioridade, quando comparado aos algoritmos do estado da arte, ainda contribuindo com a anotação das anomalias, uma tarefa que os competidores não são capazes de realizar. C-ALLOUT toma proveito da Slim-tree, uma estrutura de indexação que o torna escalável, atingindo complexidade de tempo e espaço O(nlogn). O método funciona tendo acesso aos atributos das instâncias ou limitado a distâncias. Por fim, C-ALLOUT não depende de nenhuma interação com o usuário, sendo livre de parâmetros por padrão, o ideal para tarefas não-supervisionadas como análise de anomalias.

WARNING - Viewing this document is conditioned on your acceptance of the following terms of use:
This document is only for private use for research and teaching activities. Reproduction for commercial use is forbidden. This rights cover the whole data about this document as well as its contents. Any uses or copies of this document in whole or in part must include the author's name.

GuilhermeDomingosFariaSilva.pdf (3.79 Mbytes)

Publishing Date

2022-09-15

Derived works

WARNING: Learn what derived works are clicking here.