Steve Mussmann

Assistant Professor in Georgia Tech's School of Computer Science, starting Fall 2024.

Research interests include active labeling/learning, data selection, and data-centric ML.

Georgia Tech Email, Personal Email, Google Scholar, CV

Research

Machine learning is a tool that is incorporated in a quickly increasing variety and number of systems and processes in society. My research is driven by making ML easier-to-use, more effective, and more likely to be used in beneficial ways. My research often takes the form of abstracting machine learning issues (data efficiency, interpretability, robustness, etc.) from specific application areas (computer vision, NLP, computational biology, etc.) to discover insights that lead to more useful algorithms and more reliable best practices. By using a mix of theoretical and experimental techniques, my research takes a broad perspective while ensuring practical relevance.

Research on learning algorithms has seen remarkable progress over the past decade, especially with regards to text and images, which has ignited interest in machine learning. While the learning algorithm is critical to an ML system, there are many other aspects that are under-studied, including data sourcing, pre-processing, annotation, cleaning, validation, and monitoring which all significantly affect the reliability and usability of the system. My work often falls under the umbrella of data-centric machine learning, where the focus is on improving the quality of the data while the model architecture and optimization algorithm are held fixed.

Much of my previous work falls into one of two categories:

Active Labeling/Learning: human supervision and interaction with nature (experiments) can be expensive and slow. Can we design efficient algorithms to iteratively choose data to label for use cases where collecting labels is expensive so that we can significantly decrease the cost and effort of labeling?
Data Selection: given increasingly large and noisy data sets, training on all available data can be expensive and can yield sub-optimal performance for specific tasks. Can we efficiently select training data that yield more accurate predictors?

Teaching

CS 8803 Fall 2024, Data-Centric Machine Learning

In Fall 2024, I am teaching a special topics course CS 8803 titled "Data-centric Machine Learning". This course will be focused on reading, reviewing, and discussing research papers and working on a semester-long team research project. The students are expected to have a strong grasp of machine learning concepts, a solid background in probability and linear algebra, and the ability to implement algorithms and run experiments. This course should not be your first course in machine learning, but should be a course to build knowledge in the sub-area of data-centric ML on top of a strong ML foundation.

Tentative preliminary schedule

Week 1: course logistics, data-centric ML overview, students provide preferences for which paper they will present and lead a discussion.
Week 2: real-world challenges related to the output (label) data distribution
- Ambiguous label definitions and many label sources lead to noisy labels
- Limited labeled data requiring the use of other data or strategies
- Various types of supervision beyond the desired output of a system
- Multiple “correct” system outputs that cannot be easily represented by an annotation (e.g. NLP evaluation)
Week 3: real-world challenges related to (input) data distribution
- “Spurious correlations" enable good performance on iid data, but generalize very poorly to seemingly slight distribution shifts
- Good performance “on average” but weak performance on important but small subpopulations
- Dataset bias for pre-training datasets
- Monitoring for data drift
Weeks 4 & 5: Active labeling/learning
- Choosing points based on uncertainty, diversity, and/or representativeness
- Experimental design for gathering information
- Human-in-the-loop approaches (abstentions, continual improvement)
Weeks 6 & 7: Data selection/curation
- Curating web-sourced data (LLMs, etc.)
- Outliers and misleading data
- Batch/subset selection, curriculum learning, continual learning, etc.
Weeks 8 & 9: Supervision beyond gold outputs
- Self-supervision
- Semi-supervised learning (low density on decision boundary)
- Data programming
- Multi instance learning (learning from aggregated labels)
- PU or censored learning
- Balancing various fidelities
Week 10: mid-semester research project presentations
Week 11: evaluation challenges
- Data leakage and dependencies
- Multiple “correct” outputs
- Monitoring
Week 12: dataset expansion and compression
- Data augmentations
- Model/dataset distillation
Week 13: data governance and economies
- Attributing predictions to training data
- Data valuation
- Data privacy
Weeks 14 & 15: final research project presentations