Invited speakers at EDBT/ICDT 2011

TITLE: On Provenance and Privacy

SPEAKER: Susan B. Davidson, University of Pennsylvania, USA

ABSTRACT:
Provenance in scientific workflows is a double-edged sword. On the
one hand, recording information about the module executions used to
produce a data item, as well as the parameter settings and
intermediate data items passed between module executions, enables
transparency and reproducibility of results. On the other hand, a
scientific workflow often uses proprietary modules and contains
private or confidential data, such as medical or financial
information. Hence, providing exact answers to provenance queries over
all executions of the workflow may reveal private information. In this
talk we discuss potential privacy concerns in scientific workflows --
module privacy, data privacy, and structural privacy - and frame
several natural questions: (i) Can we formally analyze module, data or
structural privacy, giving provable privacy guarantees for an
unlimited/bounded number of provenance queries? (ii) How can we answer
provenance queries, providing as much information as possible to the
user while still guaranteeing the required privacy? Then we look at
module privacy in detail and propose a formal model. Finally, we point
to several directions for future work.

BIO:
Susan B. Davidson received the B.A. degree in Mathematics from Cornell
University, Ithaca, NY, in 1978, and the M.A. and Ph.D. degrees in
Electrical Engineering and Computer Science from Princeton University,
Princeton NJ, in 1980 and 1982. Dr. Davidson is the Weiss Professor
and Chair of Computer and Information Science at the University of
Pennsylvania, where she has been since 1982. She also served as
Deputy Dean of the School of Engineering and Applied Science from
2005-2007.

Dr. Davidson's research interests include database and web-based
systems, and bioinformatics. Within bioinformatics she is best known
for her work with the Kleisli data integration system (joint work with
Drs. Buneman, Tannen and Overton), which was subsequently
commercialized in the company GeneticXChange. Her most recent work
has centered on provenance in scientific workflow systems.

Dr. Davidson was the founding co-director of the Penn Center for
Bioinformatics from 1997-2003, and the founding co-leader of the
Greater Philadelphia Bioinformatics Alliance. She holds a secondary
appointment in the Department of Genetics, is an ACM Fellow, received
the Lenore Rowe Williams Award (2002), and was a Fulbright Scholar and
recipient of a Hitachi Chair (2004).

TITLE: The PADS Project: An Overview

SPEAKER: Kathleen Fisher, Tufts University, USA

ABSTRACT:
The goal of the PADS project, which started in 2001, is to make it easier for data analysts to extract useful information from ad hoc data files. This paper does not report new results, but rather gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc data and the managed world of typed programming languages and databases. In particular, the paper reviews the design of PADS data description languages, describes the generated parsing tools and discusses the importance of meta-data. It also sketches the formal semantics, discusses useful tools and how can they can be generated automatically from PADS descriptions, and describes an inferencing system that can learn useful PADS descriptions from positive examples of the data format.

BIO:
Kathleen Fisher is Professor in the Computer Science Department at
Tufts. Previously, she was a Principal Member of the Technical Staff
at AT&T Labs Research and a Consulting Faculty Member in the Computer
Science Department at Stanford University. Kathleen's research
focuses on advancing the theory and practice of programming languages
and on applying ideas from the programming language community to the
problem of ad hoc data management. The main thrust of her work has
been in domain-specific languages to facilitate programming with
massive amounts of ad hoc data, including the Hancock system for
efficiently building signatures from massive transaction streams and
the PADS system for managing ad hoc data. She has served as program
chair for FOOL, ICFP, CUFP, and OOPSLA, and she is an ACM
Fellow. Kathleen is past Chair of the ACM Special Interest Group in
Programming Languages (SIGPLAN), Co-Chair of CRA's Committee on the
Status of Women (CRA-W), and an editor of the Journal of Functional
Programming.

TITLE: Tractability in Probabilistic Databases

SPEAKER: Dan Suciu, University of Washington, USA

ABTRACT:
A major challenge in data management is how to manage uncertain data. Many reasons for the uncertainty exists: the data may be extracted
automatically from text, it may be derived from the physical world such as RFID data, it may be integrated using fuzzy matches, or may be the result of complex stochastic models. Whatever the reason for the uncertainty, a data management system needs to offer predictable performance to queries over large instances of uncertain data.

In this talk I will address a fundamental computational problem in probabilistic databases: given a query, what is the complexity of
evaluating it over probabilistic databases ? Probabilistic inference
is known to be hard in general, but once we fix a query, it becomes a
specialized problem. I will show that Unions of Conjunctive Queries
(also known as non-recursive datalog rules) admit a dichotomy: every
query is either provably #P hard, or can be evaluated in PTIME. For
practical purposes, the most interesting part of this dichotomy is the
PTIME algorithm. It uses in a fundamental way the Mobius' inversion
formula on finite lattices (which is the inclusion-exclusion formula
plus term cancellation), and, because of that, it can perform probabilistic inference in PTIME on classes of Boolean expressions
where other established methods fail, including OBDDs, FBDDs,
inference based on bounded tree widths, or d-DNNF's.

BIO:
Dan Suciu is a Professor in Computer Science at the University of
Washington. He received his Ph.D. from the University of Pennsylvania
in 1995, then was a principal member of the technical staff at AT&T
Labs until he joined the University of Washington in 2000. Suciu is
conducting research in data management, with an emphasis on topics
that arise from sharing data on the Internet, such as management of
semistructured and heterogeneous data, data security, and managing
data with uncertainties. He is a co-author of the book Data on the
Web: from Relations to Semistructured Data and XML, holds twelve US
patents, received the 2000 ACM SIGMOD Best Paper Award, the 2010 PODS
Ten Year Test of Time Award, is a recipient of the NSF Career Award
and of an Alfred P. Sloan Fellowship.

TITLE: Map-Reduce Extensions and Recursive Queries

SPEAKER: Jeff Ullman, Stanford University, USA

ABSTRACT:
We survey the recent wave of extensions to the popular map-reduce systems, including those that have begun to address the implementation
of recursive queries using the same computing environment as map-reduce. A central problem is that recursive tasks cannot deliver
their output only at the end, which makes recovery from failures much
more complicated than in map-reduce and its nonrecursive extensions.
We propose several algorithmic ideas for efficient implementation of
recursions in the map-reduce environment and discuss several
alternatives for supporting recovery from failures without restarting
the entire job.

BIO:
Jeff Ullman is the Stanford W. Ascherman Professor of Engineering
(Emeritus) in the Department of Computer Science at Stanford and CEO
of Gradiance Corp. He received the B.S. degree from Columbia
University in 1963 and the PhD from Princeton in 1966. Prior to his
appointment at Stanford in 1979, he was a member of the technical
staff of Bell Laboratories from 1966-1969, and on the faculty of
Princeton University between 1969 and 1979. From 1990-1994, he was
chair of the Stanford Computer Science Department. He has served as
chair of the CS-GRE Examination board, Member of the ACM Council,
Chair of the New York State CS Doctoral Evaluation Board, on several
NSF advisory boards, and is past or present editor of several
journals.

Ullman was elected to the National Academy of Engineering in 1989 and
has held Guggenheim and Einstein Fellowships. He has received the
Sigmod Contributions Award (1996), the ACM Karl V. Karlstrom
Outstanding Educator Award (1998), the Knuth Prize (2000), the Sigmod
E. F. Codd Innovations award (2006), and the IEEE von Neumann medal
(2010). He is the author of 16 books, including widely read books on
database systems, compilers, automata theory, and algorithms.

TITLE: Database Researchers: Plumbers or Thinkers?

SPEAKER: Gerhard Weikum, Max-Planck Institute for Informatics, Germany

ABSTRACT:
DB researchers have traditionally focused on engine-centered issues
such as indexing, query processing, and transactions. Data mining has
broadened the community's viewpoint towards algorithmic and
statistical issues. However, DB research has always had a tendency to
shy away from seemingly elusive long-term challenges with AI flavor.
On the other hand, the current explosion of digital content in
enterprises and the Internet, is mostly caused by user-created
information like text, tags, photos, videos, and not by seeing more
well-designed databases of the traditional kind.

In this situation, I question the traditional skepticism of DB
researchers towards ``AI-complete'' problems and the DB community's
reluctance to embark on seemingly non-DB-ish grand challenges. Big
questions that I see as great opportunities also for DB research
include: 1) automatic extraction of relational facts from
natural-language text and multimodal contexts, 2) automatic
disambiguation of named-entity mentions and general phrases in text
and speech, 3) large-scale gathering of factual-knowledge candidates
and their reconciliation into comprehensive knowledge bases, 4)
reasoning on uncertain hypotheses, for knowledge discovery and
semantic search, 5) deep and real-time question answering, e.g., to
enable computers to win quiz game shows, 6)
machine-reading of scientific publications and fictional literature,
to enable corpus-wide analyses and enable researchers in science and
humanities to develop hypotheses and quickly focus on the most
relevant issues.

I believe that successfully tackling these topics requires efficient
data-centric algorithms, scalable methods and architectures, and
system-level thinking - virtues that are richly available in the DB
research community. Moreover, I would encourage our community to look
across the fence and get more engaged on the exciting challenges
outside the traditionally narrow boundaries of the DB realm. I will
illustrate these points by examples from my own research on knowledge
management. Breakthroughs will require long-term stamina. In the
meantime, steady incremental progress is better than not embarking on
these important problems at all.

BIO:
Gerhard Weikum is a Research Director at the Max-Planck Institute for
Informatics (MPII) in Saarbruecken, Germany, where he is leading the
department on databases and information systems. He is also an
adjunct professor in the Department of Computer Science of Saarland
University in Saarbruecken, Germany, and he is a principal
investigator of the Cluster of Excellence on Multimodal Computing and
Interaction. Earlier he held positions at Saarland University in
Saarbruecken, Germany, at ETH Zurich, Switzerland, at MCC in Austin,
Texas, and he was a visiting senior researcher at Microsoft Research
in Redmond, Washington. He received his diploma and doctoral degrees
from the University of Darmstadt, Germany. Gerhard Weikum is an ACM
Fellow, a Fellow of the German Computer Society, and a member of the
German Academy of Science and Engineering. He has served on various
editorial boards, including Communications of the ACM, and as program
committee chair of conferences like ACM SIGMOD, IEEE Data Engineering,
and CIDR. From 2003 through 2009 he was president of the VLDB
Endowment.