Computational Math and Statistics Seminar by Jiangrui Kang: Representative and Diverse Subdata Selection for Semi-supervised Learning and Beyond
Speaker: Jiangrui Kang, Ph.D. candidate, Illinois Institute of Technology
Title: Representative and Diverse Subdata Selection for Semi-supervised Learning and Beyond
Abstract: Semi-Supervised Learning (SSL) is a popular paradigm which effectively utilizes the labeled and unlabeled data to improve the training performance. If the labeled data is not pregiven, how to select samples for labeling also significantly impacts performance, particularly when the labelling budget is extremely low. However, previous studies in this problem either lack theoretical guarantees or are not practical. To fill in this gap, we propose a Representative and Diverse Subdata Selection approach (RDSS), which minimizes a novel criterion ą-Maximum Mean Discrepancy (ą-MMD) that evaluates both the representativeness and diversity of the subdata. We proved the generalization ability of this approach for low-budget learning and the finite-sample-error bound for the optimization algorithm. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL). Potential future applications of ą-MMD in other SSL and Bayesian optimization problems are then presented if time permits.
Computational Mathematics and Statistics Seminar