Research Track Papers

Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Zhang, Jiawei Han

Paper Link

Video Link

Mining a set of meaningful topics organized into a hierarchy is intuitively appealing since topic correlations are ubiquitous in massive text corpora. To account for potential hierarchical topic structures, hierarchical topic models generalize flat topic models by incorporating latent topic hierarchies into their generative modeling process. However, due to their purely unsupervised nature, the learned topic hierarchy often deviates from users’ particular needs or interests. To guide the hierarchical topic discovery process with minimal user supervision, we propose a new task, Hierarchical Topic Mining, which takes a category tree described by category names only, and aims to mine a set of representative terms for each category from a text corpus to help a user comprehend his/her interested topics. We develop a novel joint tree and text embedding method along with a principled optimization procedure that allows simultaneous modeling of the category tree structure and the corpus generative process in the spherical space for effective category-representative term discovery. Our comprehensive experiments show that our model, named JoSH, mines a high-quality set of hierarchical topics with high efficiency and benefits weakly-supervised hierarchical text classification tasks.

 

STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths

Yue Yu, Yinghao Li, Jiaming Shen, Hao Feng, Jimeng Sun, Chao Zhang

Illustration of the taxonomy expansion problem. Given an existing taxonomy, the task is to insert new concept terms (e.g., EMI, stratospheric pollutant, economic noise, carcinogenic substance) into the correct positions in the existing taxonomy

Paper Link

Video Link

Taxonomies are important knowledge ontologies that underpin numerous applications on a daily basis including product navigations, web searches, and recommendations. However, many taxonomies used in practice suffer from the low coverage issue. We study the taxonomy expansion problem, which aims to expand existing taxonomies with new concept terms. We propose a self-supervised taxonomy expansion model named STEAM, which leverages natural supervision in the existing taxonomy for expansion. To generate natural self-supervision signals, our method samples mini-paths from the existing taxonomy, and formulates a node attachment prediction task between anchor mini-paths and query terms. To solve the node attachment task, it learns feature representations for query-anchor pairs from multiple views and performs multi-view co-training for prediction. Extensive experiments show that STEAM outperforms state-of-the-art methods for taxonomy expansion by 11.6% in accuracy and 7.0% in mean reciprocal rank on three public benchmarks.

 

BOND: Bert-Assisted Open-Domain Named Entity Recognition with Distant Supervision

Chen Liang, Yue Yu, Haoming Jiang, Siawpeng Er, Ruijia Wang, Tuo Zhao, Chao Zhang

The two-stage BOND framework. In Stage I, the pre-trained BERT is adapted to the distantly supervised NER task with early stopping. In Stage II, a student model and a teacher model are first initialized from the model learned in Stage I. Then the student model is trained using pseudo-labels generated by the teacher model. Meanwhile, the teacher model is iteratively updated by the early-stopped student.

Paper Link

Video Link

We study the open-domain named entity recognition (NER) problem under distant supervision. The distant supervision, though does not require large amounts of manual annotations, yields highly incomplete and noisy distant labels via external knowledge bases. To address this challenge, we propose a new computational framework — BOND, which leverages the power of pre-trained language models (e.g., BERT and RoBERTa) to improve the prediction performance of NER models. Specifically, we propose a two-stage training algorithm: In the first stage, we adapt the pre-trained language model to the NER tasks using the distant labels, which can significantly improve the recall and precision; In the second stage, we drop the distant labels, and propose a self-training approach to further improve the model performance. Thorough experiments on 5 benchmark datasets demonstrate the superiority of BOND over existing distantly supervised NER methods.

 

LogPar: Logistic PARAFAC2 Factorization for Temporal Binary Data with Missing Values

Kejing Yin, Ardavan Afshar, Joyce Ho, William Cheung, Chao Zhang, Jimeng Sun

PARAFAC2 model for computational phenotyping: The input is a collection of binary matrices, with each of them corresponding to a patient. They have the same num- ber of columns representing diseases, but different numbers of rows representing clinical visits. Value 1s in those matri- ces indicate confirmation of disease while value 0 means ei- ther the absence of the disease or missing diagnosis.

Paper Link

Video Link

Binary data with one-class missing values are ubiquitous in real-world applications. They can be represented by irregular tensors with varying sizes in one dimension, where value one means presence of a feature while zero means unknown (i.e., either presence or absence of a feature). Learning accurate low-rank approximations from such binary irregular tensors is a challenging task. However, none of the existing models developed for factorizing irregular tensors take the missing values into account, and they assume Gaussian distributions, resulting in a distribution mismatch when applied to binary data. In this paper, we propose Logistic PARAFAC2 (LogPar) by modeling the binary irregular tensor with Bernoulli distribution parameterized by an underlying real-valued tensor. Then we approximate the underlying tensor with a positive-unlabeled learning loss function to account for the missing values. We also incorporate uniqueness and temporal smoothness regularization to enhance the interpretability. Extensive experiments using large-scale real-world datasets show that LogPar outperforms all baselines in both irregular tensor completion and downstream predictive tasks. For the irregular tensor completion, LogPar achieves up to 26% relative improvement compared to the best baseline. Besides, LogPar obtains relative improvement of 13.2% for heart failure prediction and 14% for mortality prediction on average compared to the state-of-the-art PARAFAC2 models.

 

HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units

Shenda Hong, Yanbo Xu,Alind Khare, Satria Priambada, Kevin Maher, Alaa Aljiffry, Jimeng Sun, Alexey TumanovA diagram of the HOLMES system architecture: model zoo, ensemble composer and real-time serving system:

Paper Link

Video Link

Deep learning models have achieved expert-level performance in healthcare with an exclusive focus on training accurate models. However, in many clinical environments such as intensive care unit (ICU), real-time model serving is equally if not more important than accuracy, because in ICU patient care is simultaneously more urgent and more expensive. Clinical decisions and their timeliness, therefore, directly affect both the patient outcome and the cost of care. To make timely decisions, we argue the underlying serving system must be latency-aware. To compound the challenge, health analytic applications often require a combination of models instead of a single model, to better specialize individual models for different targets, multi-modal data, different prediction windows, and potentially personalized predictions. To address these challenges, we propose HOLMES-an online model ensemble serving framework for healthcare applications. HOLMES dynamically identifies the best performing set of models to ensemble for highest accuracy, while also satisfying sub-second latency constraints on end-to-end prediction. HOLMES is tested on risk prediction task on pediatric cardio ICU data with above 95% prediction accuracy and sub-second latency on 64-bed simulation.