Christos Faloutsos

Carnegie Mellon University

H-index: 151

North America-United States

Description

Christos Faloutsos, With an exceptional h-index of 151 and a recent h-index of 79 (since 2020), a distinguished researcher at Carnegie Mellon University, specializes in the field of Data Mining, Graph Mining, Databases.

His recent articles reflect a diverse array of research interests and contributions to the field:

Large Language Models on Tabular Data--A Survey

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition

DataLore: Can a large language model find all lost scrolls in a data repository?

4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs

OpenTab: Advancing Large Language Models as Open-domain Table Reasoners

EBV: Electronic Bee-Veterinarian for Principled Mining and Forecasting of Honeybee Time Series

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

DiffFind: Discovering Differential Equations from Time Series

Professor Information

University	Carnegie Mellon University
Position	___
Citations(all)	114733
Citations(since 2020)	29768
Cited By	99586
hIndex(all)	151
hIndex(since 2020)	79
i10Index(all)	591
i10Index(since 2020)	359
Email	Access Email
University Profile Page	Carnegie Mellon University

Research & Interests List

Data Mining

Graph Mining

Databases

Top articles of Christos Faloutsos

Large Language Models on Tabular Data--A Survey

Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Authors

Xi Fang,Weijie Xu,Fiona Anting Tan,Jiani Zhang,Ziqing Hu,Yanjun Qi,Scott Nickleach,Diego Socolinsky,Srinivasan Sengamedu,Christos Faloutsos

Published Date

2024/2/27

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition

Open-domain real-world entity recognition is essential yet challenging, involving identifying various entities in diverse environments. The lack of a suitable evaluation dataset has been a major obstacle in this field due to the vast number of entities and the extensive human effort required for data curation. We introduce Entity6K, a comprehensive dataset for real-world entity recognition, featuring 5,700 entities across 26 categories, each supported by 5 human-verified images with annotations. Entity6K offers a diverse range of entity names and categorizations, addressing a gap in existing datasets. We conducted benchmarks with existing models on tasks like image captioning, object detection, zero-shot classification, and dense captioning to demonstrate Entity6K's effectiveness in evaluating models' entity recognition capabilities. We believe Entity6K will be a valuable resource for advancing accurate entity recognition in open-domain settings.

Authors

Jielin Qiu,William Han,Winfred Wang,Zhengyuan Yang,Linjie Li,Jianfeng Wang,Christos Faloutsos,Lei Li,Lijuan Wang

Journal

arXiv preprint arXiv:2403.12339

Published Date

2024/3/19

DataLore: Can a large language model find all lost scrolls in a data repository?

How can we effectively generate missing data transformations among tables in a data repository? Multiple versions of the same tables are generated from the iterative process when data scientists and machine learning engineers fine-tune their ML pipelines, making incremental improvements. This process often involves data transformation and augmentation that produces an augmented table based on its base version and related tables. However, data transformations are often not well-documented or completely missing, resulting in poor traceability, reproducibility and explainability of ML pipelines. In this paper, we propose DATALORE, a framework that explains data changes between an initial dataset and its augmented version to improves traceability. Given a base table, DATALORE first discovers its potentially related tables from the data repository using a variety of data discovery techniques. DATALORE then effectively leverages a large language model (LLM) to generate a variety of data transformations that lead to the augmented table. DATALORE validates these transformations and selects the minimum number of related tables to ensure traceability and reproducibility of the ML pipelines. A preliminary experiment shows that DATALORE is able to effectively recovery data transformations on two benchmark datasets.

Authors

Yuze Lou,Chuan Lei,Xiao Qin,Zichen Wang,Christos Faloutsos,Rishita Anubhai,Huzefa Rangwala

Published Date

2024

4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs

Although RDBs store vast amounts of rich, informative data spread across interconnected tables, the progress of predictive machine learning models as applied to such tasks arguably falls well behind advances in other domains such as computer vision or natural language processing. This deficit stems, at least in part, from the lack of established/public RDB benchmarks as needed for training and evaluation purposes. As a result, related model development thus far often defaults to tabular approaches trained on ubiquitous single-table benchmarks, or on the relational side, graph-based alternatives such as GNNs applied to a completely different set of graph datasets devoid of tabular characteristics. To more precisely target RDBs lying at the nexus of these two complementary regimes, we explore a broad class of baseline models predicated on: (i) converting multi-table datasets into graphs using various strategies equipped with efficient subsampling, while preserving tabular characteristics; and (ii) trainable models with well-matched inductive biases that output predictions based on these input subgraphs. Then, to address the dearth of suitable public benchmarks and reduce siloed comparisons, we assemble a diverse collection of (i) large-scale RDB datasets and (ii) coincident predictive tasks. From a delivery standpoint, we operationalize the above four dimensions (4D) of exploration within a unified, scalable open-source toolbox called 4DBInfer. We conclude by presenting evaluations using 4DBInfer, the results of which highlight the importance of considering each such dimension in the design of RDB predictive models, as well as the …

Authors

Minjie Wang,Quan Gan,David Wipf,Zhenkun Cai,Ning Li,Jianheng Tang,Yanlin Zhang,Zizhao Zhang,Zunyao Mao,Yakun Song,Yanbo Wang,Jiahang Li,Han Zhang,Guang Yang,Xiao Qin,Chuan Lei,Muhan Zhang,Weinan Zhang,Christos Faloutsos,Zheng Zhang

Journal

arXiv preprint arXiv:2404.18209

Published Date

2024/4/28

OpenTab: Advancing Large Language Models as Open-domain Table Reasoners

Large Language Models (LLMs) trained on large volumes of data excel at various natural language tasks, but they cannot handle tasks requiring knowledge that has not been trained on previously. One solution is to use a retriever that fetches relevant information to expand LLM's knowledge scope. However, existing textual-oriented retrieval-based LLMs are not ideal on structured table data due to diversified data modalities and large table sizes. In this work, we propose OpenTab, an open-domain table reasoning framework powered by LLMs. Overall, OpenTab leverages table retriever to fetch relevant tables and then generates SQL programs to parse the retrieved tables efficiently. Utilizing the intermediate data derived from the SQL executions, it conducts grounded inference to produce accurate response. Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and closed-domain settings, achieving up to 21.5% higher accuracy. We further run ablation studies to validate the efficacy of our proposed designs of the system.

Authors

Kezhi Kong,Jiani Zhang,Zhengyuan Shen,Balasubramaniam Srinivasan,Chuan Lei,Christos Faloutsos,Huzefa Rangwala,George Karypis

Journal

arXiv preprint arXiv:2402.14361

Published Date

2024/2/22

EBV: Electronic Bee-Veterinarian for Principled Mining and Forecasting of Honeybee Time Series

Honeybees are vital for pollination and food production. Among many factors, extreme temperature (e.g., due to climate change) is particularly dangerous for bee health. Anticipating such extremities would allow beekeepers to take early preventive action. Thus, given sensor (temperature) time series data from beehives, how can we find patterns and do forecasting? Forecasting is crucial as it helps spot unexpected behavior and thus issue warnings to the beekeepers. In that case, what are the right models for forecasting? ARIMA, RNNs, or something else? We propose the EBV (Electronic Bee-Veterinarian) method, which has the following desirable properties: (i) principled: it is based on a) diffusion equations from physics and b) control theory for feedback-loop controllers; (ii) effective: it works well on multiple, real-world time sequences, (iii) explainable: it needs only a handful of parameters (e.g., bee strength) that …

Authors

Mst Shamima Hossain,Christos Faloutsos,Boris Baer,Hyoseung Kim,Vassilis J Tsotras

Published Date

2024

McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.

Authors

Braulio V Sánchez Vinces,Robson LF Cordeiro,Christos Faloutsos

Journal

arXiv preprint arXiv:2403.08027

Published Date

2024/3/12

DiffFind: Discovering Differential Equations from Time Series

Given one or more time sequences, how can we extract their governing equations? Single and co-evolving time sequences appear in numerous settings, including medicine (neuroscience - EEG signals, cardiology - EKG), epidemiology (covid/flu spreading over time), physics (astrophysics, material science), marketing (sales and competition modeling; market penetration), and numerous more. Linear differential equations will fail, since the underlying equations are often non-linear (SIR model for virus/product spread; Lotka-Volterra for product/species competition, Van der Pol for heartbeat modeling).We propose DiffFind and we use genetic algorithms to find suitable, parsimonious, differential equations. Thanks to our careful design decisions, DiffFind has the following properties - it is: (a) Effective, discovering the correct model when applied on real and synthetic nonlinear dynamical systems, (b) Explainable, gives …

Authors

Lalithsai Posam,Shubhranshu Shekhar,Meng-Chieh Lee,Christos Faloutsos

Published Date

2024/4/25