-
Research
- Similarity/Distance Measures for Discrete Data Structures
- Fast Feature Selection Algorithms
- Large Scale Bid Data Analysis in Local Governments
- User Behavior Analysis from Computer Logs
- Mutation analysis for Influenza Viruses
- Time Series Topic Extraction from Social Media
- Fast computation of Milnor’s Invariant
Research
I am involved in several research projects related to data mining and machine learning including the following topics:
- Similarity/distance measures for discrete data structures,
- Fast feature selection algorithms, and
- Data analysis.
Similarity/Distance Measures for Discrete Data Structures
I design similarity/distance measures for discrete data structures more complex than strings such as trees, ordered sets, and graphs, and apply them to data mining and machine learning methods.
Designing similarity/distance measures
Tree structures play significant roles in a lot of algorithms. In addition, many entities are expressed in the form of trees such as parse trees, phylogenetic trees and glycan primary structures.
Given two tree structures, how much similar or dissimilar are they? The measure of the similarity or dissimilarity is not uniquely determined.
The most coarse measure is a binary indicator of isomorphism between two structures. There is a linear-time algorithm for tree isomorphism. In contract, we have the following difficulties for designing flexible matching (or mismatch-tolerant) measures.
- Various measures are considered according to the applications and the computational efficiency
- The computation of these measures is, in general, intractable. For example, the edit distance problem for unordered trees due to Tai[1979] is known to be NP-hard.
The edit distance is one of the most popular measures for the distance between two structures. It is defined by the minimum cost of edit operations to transform one structure to the other.
Although a lot of measures based on tree edit distance have been proposed so far, they are not well-organized. In fact, it has been unknown that some are equivalent, and others are wrongly defined before we reveal them. In addition, we discover a neat class hierarchy among the existing various distance measures for trees according to their flexibility of matching and computational complexity.
Also, fundamental and basic properties in comparing trees have been unknown yet. For example, given node-to-node correspondences between two trees as shown in the dotted lines in the right figure, can we amalgamate two trees into a supertree by overlapping these corresponding nodes preserving the hierarchical order in each tree?
In the right example, we end up with the directed acyclic graph with a confluent red node, and fail to obtain a supertree. We first show the necessary and sufficient condition of node-to-node correspondences for amalgamating two trees into a supertree by ordered algebra.
We address the design problem of similarity/distance measures for discrete data structures mainly by the following two approaches.
- Optimization approach
- The algorithms for finding the maximum common substructure or the minimum symmetric difference of two structures (a.k.a edit distance) are addressed. This approach results in not only the distance or similarity values, but also the similar substructures showing how similar they are. (In the figures, the substructures are given in the form of dotted lines.)
- Counting approach
- The algorithms for counting common substructures are addressed. This approach factors the local similarity into the total similarity computation, and is not necessarily to be enumerative counting for efficiency. Above all, it is directly applicable to design kernel functions for statistical learning.
These theoretical contributions are applied to real-world data analysis such as “a classification problem of cancer-induced aberrant glycan structures”, and “an analysis of the hierarchical structures of time series communities”.
Collaborative research groups
-
Prof. Shin’s Lab,
Hyogo University:
- A unified theory of tree edit distance measures, and applications to tree kernels
- A general kernel design framework for discrete data structures
-
Prof. Hirata’s Lab,
Kyushu Institute of Technology:
- Algorithm complexity analysis of tree edit distance and their classification.
- Prof. Miyahara’s Lab, Hiroshima City University: Tree pattern extraction based on genetic programming
- Prof. Yamamoto’s Lab, Kyoto University: New distance measures between trees and their efficient algorithms
Fast Feature Selection Algorithms
Feature selection is a method for narrowing down a whole set of data features to a subset essential to applications such as classification, pattern recognition, and term extraction. It is a fundamental problem in machine learning and data mining.
We focus on one of the fastest feature selection algorithms CWC due to Shin[2011]. CWC is basically designed for categorical data, which models the feature selection problem as a set cover problem with a greedy search. (The right figure depicts the process.)
From the theoretical point of view, we have yet to analyze why CWC outperforms most existing algorithms of categorical feature selection. Also, we address the improvement of CWC in the search strategy and feature measurements.
Also, CWC is applied to term extractions from the vast amount of documents.
Software
- sCWC: very fast feature selection for sparse nomnal data (implemented in Scala)
Collaborative research groups
-
Prof. Shin’s Lab,
Hyogo University: CWC was first proposed by Prof. Shin in 2011.
Large Scale Bid Data Analysis in Local Governments
To be described…
Collaborative research groups
- Prof. Fukumoto’s Lab, Gakushuin University: We are now developing a large scale bid database of local governments in Japan.
-
Prof. Yamamoto’s Lab,
Kyoto University:
Time series community detection by matrix decomposition.
User Behavior Analysis from Computer Logs
To be described…
Mutation analysis for Influenza Viruses
To be described…
Collaborative research group
-
Prof. Ito’s Lab
Hokkaido University
Time Series Topic Extraction from Social Media
To be described…
Collaborative research group
- Prof. Hashimoto’s Lab, Chiba University of Commerce
Fast computation of Milnor’s Invariant
To be described…
Collaborative research groups
- Prof. Yasuhara’s Lab, Tokyo Gakugei University
- Prof. Sakamoto’s Lab, Kyushu Institute of Technology