
Research
 Similarity/Distance Measures for Discrete Data Structures
 Fast Feature Selection Algorithms
 Large Scale Bid Data Analysis in Local Governments
 User Behavior Analysis from Computer Logs
 Mutation analysis for Influenza Viruses
 Time Series Topic Extraction from Social Media
 Fast computation of Milnor’s Invariant
Research
I am involved in several research projects related to data mining and machine learning including the following topics:
 Similarity/distance measures for discrete data structures,
 Fast feature selection algorithms, and
 Data analysis.
Similarity/Distance Measures for Discrete Data Structures
I design similarity/distance measures for discrete data structures more complex than strings such as trees, ordered sets, and graphs, and apply them to data mining and machine learning methods.
Designing similarity/distance measures
Tree structures play significant roles in a lot of algorithms. In addition, many entities are expressed in the form of trees such as parse trees, phylogenetic trees and glycan primary structures.
Given two tree structures, how much similar or dissimilar are they? The measure of the similarity or dissimilarity is not uniquely determined.
The most coarse measure is a binary indicator of isomorphism between two structures. There is a lineartime algorithm for tree isomorphism. In contract, we have the following difficulties for designing flexible matching (or mismatchtolerant) measures.
 Various measures are considered according to the applications and the computational efficiency
 The computation of these measures is, in general, intractable. For example, the edit distance problem for unordered trees due to Tai[1979] is known to be NPhard.
The edit distance is one of the most popular measures for the distance between two structures. It is defined by the minimum cost of edit operations to transform one structure to the other.
Although a lot of measures based on tree edit distance have been proposed so far, they are not wellorganized. In fact, it has been unknown that some are equivalent, and others are wrongly defined before we reveal them. In addition, we discover a neat class hierarchy among the existing various distance measures for trees according to their flexibility of matching and computational complexity.
Also, fundamental and basic properties in comparing trees have been unknown yet. For example, given nodetonode correspondences between two trees as shown in the dotted lines in the right figure, can we amalgamate two trees into a supertree by overlapping these corresponding nodes preserving the hierarchical order in each tree?
In the right example, we end up with the directed acyclic graph with a confluent red node, and fail to obtain a supertree. We first show the necessary and sufficient condition of nodetonode correspondences for amalgamating two trees into a supertree by ordered algebra.
We address the design problem of similarity/distance measures for discrete data structures mainly by the following two approaches.
 Optimization approach
 The algorithms for finding the maximum common substructure or the minimum symmetric difference of two structures (a.k.a edit distance) are addressed. This approach results in not only the distance or similarity values, but also the similar substructures showing how similar they are. (In the figures, the substructures are given in the form of dotted lines.)
 Counting approach
 The algorithms for counting common substructures are addressed. This approach factors the local similarity into the total similarity computation, and is not necessarily to be enumerative counting for efficiency. Above all, it is directly applicable to design kernel functions for statistical learning.
These theoretical contributions are applied to realworld data analysis such as “a classification problem of cancerinduced aberrant glycan structures”, and “an analysis of the hierarchical structures of time series communities”.
Collaborative research groups

Prof. Shin’s Lab,
Hyogo University:
 A unified theory of tree edit distance measures, and applications to tree kernels
 A general kernel design framework for discrete data structures

Prof. Hirata’s Lab,
Kyushu Institute of Technology:
 Algorithm complexity analysis of tree edit distance and their classification.
 Prof. Miyahara’s Lab, Hiroshima City University: Tree pattern extraction based on genetic programming
 Prof. Yamamoto’s Lab, Kyoto University: New distance measures between trees and their efficient algorithms
Fast Feature Selection Algorithms
Feature selection is a method for narrowing down a whole set of data features to a subset essential to applications such as classification, pattern recognition, and term extraction. It is a fundamental problem in machine learning and data mining.
We focus on one of the fastest feature selection algorithms CWC due to Shin[2011]. CWC is basically designed for categorical data, which models the feature selection problem as a set cover problem with a greedy search. (The right figure depicts the process.)
From the theoretical point of view, we have yet to analyze why CWC outperforms most existing algorithms of categorical feature selection. Also, we address the improvement of CWC in the search strategy and feature measurements.
Also, CWC is applied to term extractions from the vast amount of documents.
Software
 sCWC: very fast feature selection for sparse nomnal data (implemented in Scala)
Collaborative research groups

Prof. Shin’s Lab,
Hyogo University: CWC was first proposed by Prof. Shin in 2011.
Large Scale Bid Data Analysis in Local Governments
To be described…
Collaborative research groups
 Prof. Fukumoto’s Lab, Gakushuin University: We are now developing a large scale bid database of local governments in Japan.

Prof. Yamamoto’s Lab,
Kyoto University:
Time series community detection by matrix decomposition.
User Behavior Analysis from Computer Logs
To be described…
Mutation analysis for Influenza Viruses
To be described…
Collaborative research group

Prof. Ito’s Lab
Hokkaido University
Time Series Topic Extraction from Social Media
To be described…
Collaborative research group
 Prof. Hashimoto’s Lab, Chiba University of Commerce
Fast computation of Milnor’s Invariant
To be described…
Collaborative research groups
 Prof. Yasuhara’s Lab, Tokyo Gakugei University
 Prof. Sakamoto’s Lab, Kyushu Institute of Technology