Research

I am involved in several research projects related to data mining and machine learning including the following topics:

Similarity/Distance Measures for Discrete Data Structures

I design similarity/distance measures for discrete data structures more complex than strings such as trees, ordered sets, and graphs, and apply them to data mining and machine learning methods.

Designing similarity/distance measures

Tree structures play significant roles in a lot of algorithms. In addition, many entities are expressed in the form of trees such as parse trees, phylogenetic trees and glycan primary structures.

Given two tree structures, how much similar or dissimilar are they? The measure of the similarity or dissimilarity is not uniquely determined.

tree mapping
A mapping between two trees
(obtained by an edit distance algorithm)

The most coarse measure is a binary indicator of isomorphism between two structures. There is a linear-time algorithm for tree isomorphism. In contract, we have the following difficulties for designing flexible matching (or mismatch-tolerant) measures.

The edit distance is one of the most popular measures for the distance between two structures. It is defined by the minimum cost of edit operations to transform one structure to the other.

Although a lot of measures based on tree edit distance have been proposed so far, they are not well-organized. In fact, it has been unknown that some are equivalent, and others are wrongly defined before we reveal them. In addition, we discover a neat class hierarchy among the existing various distance measures for trees according to their flexibility of matching and computational complexity.

Amalgamation of two trees

Also, fundamental and basic properties in comparing trees have been unknown yet. For example, given node-to-node correspondences between two trees as shown in the dotted lines in the right figure, can we amalgamate two trees into a supertree by overlapping these corresponding nodes preserving the hierarchical order in each tree?

In the right example, we end up with the directed acyclic graph with a confluent red node, and fail to obtain a supertree. We first show the necessary and sufficient condition of node-to-node correspondences for amalgamating two trees into a supertree by ordered algebra.

We address the design problem of similarity/distance measures for discrete data structures mainly by the following two approaches.

Optimization approach
The algorithms for finding the maximum common substructure or the minimum symmetric difference of two structures (a.k.a edit distance) are addressed. This approach results in not only the distance or similarity values, but also the similar substructures showing how similar they are. (In the figures, the substructures are given in the form of dotted lines.)
Counting approach
The algorithms for counting common substructures are addressed. This approach factors the local similarity into the total similarity computation, and is not necessarily to be enumerative counting for efficiency. Above all, it is directly applicable to design kernel functions for statistical learning.

These theoretical contributions are applied to real-world data analysis such as “a classification problem of cancer-induced aberrant glycan structures”, and “an analysis of the hierarchical structures of time series communities”.

Collaborative research groups

Fast Feature Selection Algorithms

Fast feature selection algorithm

Feature selection is a method for narrowing down a whole set of data features to a subset essential to applications such as classification, pattern recognition, and term extraction. It is a fundamental problem in machine learning and data mining.

We focus on one of the fastest feature selection algorithms CWC due to Shin[2011]. CWC is basically designed for categorical data, which models the feature selection problem as a set cover problem with a greedy search. (The right figure depicts the process.)

From the theoretical point of view, we have yet to analyze why CWC outperforms most existing algorithms of categorical feature selection. Also, we address the improvement of CWC in the search strategy and feature measurements.

Also, CWC is applied to term extractions from the vast amount of documents.

Software

Collaborative research groups

Large Scale Bid Data Analysis in Local Governments

Community structure formed by bidder
companies in a prefecture in Japan

To be described…



Collaborative research groups

User Behavior Analysis from Computer Logs

To be described…

Mutation analysis for Influenza Viruses

Virus gene mutations in Hamming distance
space embedded by sparse PCA

To be described…

Collaborative research group

Time Series Topic Extraction from Social Media

To be described…

Collaborative research group

To be described…