Otimização e análise das máquinas de vetores de suporte aplicadas à classificação de documentos.

Kinto, Eduardo Akira

doi:10.11606/T.3.2011.tde-04112011-151337

Home

Facilities

Doctoral Thesis

DOI

https://doi.org/10.11606/T.3.2011.tde-04112011-151337

Document

Doctoral Thesis

Author

Kinto, Eduardo Akira (Catálogo USP)

Full name

Eduardo Akira Kinto

E-mail

Institute/School/College

Escola Politécnica

Knowledge Area

Electronic Systems

Date of Defense

2011-06-17

Published

São Paulo, 2011

Supervisor

Del Moral Hernandez, Emilio (Catálogo USP)

Committee

Del Moral Hernandez, Emilio (President)
Almeida Junior, Jorge Rady de
Dória Neto, Adrião Duarte
Reis Filho, Francisco Antonio
Silva, Flávio Soares Corrêa da

Title in Portuguese

Otimização e análise das máquinas de vetores de suporte aplicadas à classificação de documentos.

Keywords in Portuguese

Aprendizado computacional
Inteligência artificial
Recuperação da informação
Redes neurais

Abstract in Portuguese

A análise das informações armazenadas é fundamental para qualquer tomada de decisão, mas para isso ela deve estar organizada e permitir fácil acesso. Quando temos um volume de dados muito grande, esta tarefa torna-se muito mais complicada do ponto de vista computacional. É fundamental, então, haver mecanismos eficientes para análise das informações. As Redes Neurais Artificiais (RNA), as Máquinas de Vetores-Suporte (Support Vector Machine - SVM) e outros algoritmos são frequentemente usados para esta finalidade. Neste trabalho, iremos explorar o SMO (Sequential Minimal Optimization) e alterá-lo, com a finalidade de atingir um tempo de treinamento menor, mas, ao mesmo tempo manter a capacidade de classificação. São duas as alterações propostas, uma, no seu algoritmo de treinamento e outra, na sua arquitetura. A primeira modificação do SMO proposta neste trabalho é permitir a atualização de candidatos ao vetor suporte no mesmo ciclo de atualização de um coeficiente de Lagrange. Dos algoritmos que codificam o SVM, o SMO é um dos mais rápidos e um dos que menos consome memória. A complexidade computacional do SMO é menor com relação aos demais algoritmos porque ele não trabalha com inversão de uma matriz de kernel. Esta matriz, que é quadrada, costuma ter um tamanho proporcional ao número de amostras que compõem os chamados vetores-suporte. A segunda proposta para diminuir o tempo de treinamento do SVM consiste na subdivisão ordenada do conjunto de treinamento, utilizando-se a dimensão de maior entropia. Esta subdivisão difere das abordagens tradicionais pelo fato de as amostras não serem constantemente submetidas repetidas vezes ao treinamento do SVM. Finalmente, é aplicado o SMO proposto para classificação de documentos ou textos por meio de uma abordagem nova, a classificação de uma-classe usando classificadores binários. Como toda classificação de documentos, a análise dos atributos é uma etapa fundamental, e aqui uma nova contribuição é apresentada. Utilizamos a correlação total ponto a ponto para seleção das palavras que formam o vetor de índices de palavras.

Title in English

Optimization and analysis of support vector machine applied to text classification.

Keywords in English

Artificial intelligence
Artificial neural network
Information retrieval
Machine learning
Support vector machine
Text classification

Abstract in English

Stored data analysis is very important when taking a decision in every business, but to accomplish this task data must be organized in a way it can be easily accessed. When we have a huge amount of information, data analysis becomes a very computational hard job. So, it is essential to have an efficient mechanism for information analysis. Artificial neural networks (ANN), support vector machine (SVM) and other algorithms are frequently used for information analysis, and also in huge volume information analysis. In this work we will explore the sequential minimal optimization (SMO) algorithm, a learning algorithm for the SVM. We will modify it aiming for a lower training time and also to maintaining its classification generalization capacity. Two modifications are proposed to the SMO, one in the training algorithm and another in its architecture. The first modification to the SMO enables more than one Lagrange coefficient update by choosing the neighbor samples of the updating pair (current working set). From many options of SVM implementation, SMO was chosen because it is one of the fastest and less memory consuming one. The computational complexity of the SMO is lower than other types of SVM because it does not require handling a huge Kernel matrix. Matrix inversion is one of the most time consuming step of SVM, and its size is as bigger as the number of support vectors of the sample set. The second modification to the SMO proposes the creation of an ordered subset using as a reference one of the dimensions; entropy measure is used to choose the dimension. This subset creation is different from other division based SVM architectures because samples are not used in more than one training pair set. All this improved SVM is used on a one-class like classification task of documents. Every document classification problem needs a good feature vector (feature selection and dimensionality reduction); we propose in this work a novel feature indexing mechanism using the pointwise total correlation.

WARNING - Viewing this document is conditioned on your acceptance of the following terms of use:
This document is only for private use for research and teaching activities. Reproduction for commercial use is forbidden. This rights cover the whole data about this document as well as its contents. Any uses or copies of this document in whole or in part must include the author's name.

Eduardo_Kinto_Final_PosDefesa.pdf (1.58 Mbytes)

Publishing Date

2011-11-21

Derived works

WARNING: Learn what derived works are clicking here.