Steve Mussmann
Assistant Professor in Georgia Tech's School of Computer Science, starting Fall 2024.
Research interests include active labeling/learning, data selection, and data-centric ML.
Research
Machine learning is a tool that is incorporated in a quickly increasing variety and number of systems and processes in society. My research is driven by making ML easier-to-use, more effective, and more likely to be used in beneficial ways. My research often takes the form of abstracting machine learning issues (data efficiency, interpretability, robustness, etc.) from specific application areas (computer vision, NLP, computational biology, etc.) to discover insights that lead to more useful algorithms and more reliable best practices. By using a mix of theoretical and experimental techniques, my research takes a broad perspective while ensuring practical relevance.
Research on learning algorithms has seen remarkable progress over the past decade, especially with regards to text and images, which has ignited interest in machine learning. While the learning algorithm is critical to an ML system, there are many other aspects that are under-studied, including data sourcing, pre-processing, annotation, cleaning, validation, and monitoring which all significantly affect the reliability and usability of the system. My work often falls under the umbrella of data-centric machine learning, where the focus is on improving the quality of the data while the model architecture and optimization algorithm are held fixed.
Much of my previous work falls into one of two categories:
Active Labeling/Learning: human supervision and interaction with nature (experiments) can be expensive and slow. Can we design efficient algorithms to iteratively choose data to label for use cases where collecting labels is expensive so that we can significantly decrease the cost and effort of labeling?
Data Selection: given increasingly large and noisy data sets, training on all available data can be expensive and can yield sub-optimal performance for specific tasks. Can we efficiently select training data that yield more accurate predictors?
Teaching
CS 8803 Fall 2024, Data-Centric Machine Learning
In Fall 2024, I am teaching a special topics course CS 8803 titled "Data-centric Machine Learning". This course will be focused on reading, reviewing, and discussing research papers and working on a semester-long team research project. The students are expected to have a strong grasp of machine learning concepts, a solid background in probability and linear algebra, and the ability to implement algorithms and run experiments. This course should not be your first course in machine learning, but should be a course to build knowledge in the sub-area of data-centric ML on top of a strong ML foundation.
Tentative preliminary schedule
Week 1: course logistics, data-centric ML overview, students provide preferences for which paper they will present and lead a discussion.
Week 2: real-world challenges related to the output (label) data distribution
Ambiguous label definitions and many label sources lead to noisy labels
Limited labeled data requiring the use of other data or strategies
Various types of supervision beyond the desired output of a system
Multiple “correct” system outputs that cannot be easily represented by an annotation (e.g. NLP evaluation)
Week 3: real-world challenges related to (input) data distribution
“Spurious correlations" enable good performance on iid data, but generalize very poorly to seemingly slight distribution shifts
Good performance “on average” but weak performance on important but small subpopulations
Dataset bias for pre-training datasets
Monitoring for data drift
Weeks 4 & 5: Active labeling/learning
Choosing points based on uncertainty, diversity, and/or representativeness
Experimental design for gathering information
Human-in-the-loop approaches (abstentions, continual improvement)
Weeks 6 & 7: Data selection/curation
Curating web-sourced data (LLMs, etc.)
Outliers and misleading data
Batch/subset selection, curriculum learning, continual learning, etc.
Weeks 8 & 9: Supervision beyond gold outputs
Self-supervision
Semi-supervised learning (low density on decision boundary)
Data programming
Multi instance learning (learning from aggregated labels)
PU or censored learning
Balancing various fidelities
Week 10: mid-semester research project presentations
Week 11: evaluation challenges
Data leakage and dependencies
Multiple “correct” outputs
Monitoring
Week 12: dataset expansion and compression
Data augmentations
Model/dataset distillation
Week 13: data governance and economies
Attributing predictions to training data
Data valuation
Data privacy
Weeks 14 & 15: final research project presentations