Skip to content

We have recently had our article, "Bayesian belief networks for predicting drinking water distribution system pipe breaks," accepted for publication in Reliability Engineering and System Safety. It is now available online from the publisher.

This was one of the most rewarding papers I've written, because it allowed me to learn so much more about one of my favorite modeling techniques, the Bayesian Network. Specifically, the challenge of this paper is in learning the network from the data, instead of taking the more popular approach of assuming a network structure a priori. I am still not finished investigating the use of Bayesian Networks in infrastructure data problems, but I'm excited about this first step.

The abstract is quoted below:

In this paper, we use Bayesian Belief Networks (BBNs) to construct a knowledge model for pipe breaks in a water zone. To the authors’ knowledge, this is the first attempt to model drinking water distribution system pipe breaks using BBNs. Development of expert systems such as BBNs for analyzing drinking water distribution system data is not only important for pipe break prediction, but is also a first step in preventing water loss and water quality deterioration through the application of machine learning techniques to facilitate data-based distribution system monitoring and asset management. Due to the difficulties in collecting, preparing, and managing drinking water distribution system data, most pipe break models can be classified as “statistical-physical” or “hypothesis-generating.” We develop the BBN with the hope of contributing to the “hypothesis-generating” class of models, while demonstrating the possibility that BBNs might also be used as “statistical-physical” models. Our model is learned from pipe breaks and covariate data from a mid-Atlantic United States (U.S.) drinking water distribution system network. BBN models are learned using a constraint-based method, a score-based method, and a hybrid method. Model evaluation is based on log-likelihood scoring. Sensitivity analysis using mutual information criterion is also reported. While our results indicate general agreement with prior results reported in pipe break modeling studies, they also suggest that it may be difficult to select among model alternatives. This model uncertainty may mean that more research is needed for understanding whether additional pipe break risk factors beyond age, break history, pipe material, and pipe diameter might be important for asset management planning.

Today, Dr. Francis is giving a talk titled "Two Studies in Using Graphical Model for Infrastructure Risk Models" discussing some recent peer-reviewed conference papers given at ICVRAM and PSAM11/ESREL12.  The abstract for today's talk is:

In this talk, I will discuss the use of Bayesian Belief Networks (BBNs) and Classification and Regression Trees (CART) for infrastructure risk modeling.  In the first case study, we focus on supporting risk models used to quantify economic risk due to damage to building stock attributable to hurricanes. The increasingly complex interaction between natural hazards and human activities requires more accurate data to describe the regional exposure to potential loss from physical damage to buildings and infrastructure. While databases contain information on the distribution and features of the building stock, infrastructure, transportation, etc., it is not unusual that portions of the information are missing from the available databases. Missing or low quality data compromise the validity of regional loss projections. Consequently, this paper uses Bayesian Belief Networks and Classification and Regression Trees to populate the missing information inside a database based on the structure of the available data. In the second case study, we use Bayesian Belief Networks (BBNs) to construct a knowledge model for pipe breaks in a water zone.  BBN modeling is a critical step towards real-time distribution system management.  Development of expert systems for analyzing real-time data is not only important for pipe break prediction, but is also a first step in preventing water loss and water quality deterioration through the application of machine learning techniques to facilitate real-time distribution system monitoring and management.  Our model is based on pipe breaks and covariate data from a mid-Atlantic United States (U.S.) drinking water distribution system network. The expert model is learned using a conditional independence test method, a score-based method, and a hybrid method, then subjected to 10-fold cross validation based on log-likelihood scores.

This talk is hosted by Ketra Schmitt in the Center for Engineering in Society on the Faculty of Engineering and Computer Science.

A recent SEED Group paper, "Bayesian Belief Networks for Predicting Drinking Water Distribution System Pipe Breaks" was presented at PSAM11/ESREL12 in Helsinki, Finland.  This peer-reviewed conference paper was co-authored by Dr. Francis with JHU collaborators Dr. Seth Guikema and Lucas Henneman.  The abstract of this paper follows:

In this project, we use Bayesian Belief Networks (BBNs) to construct a knowledge model for pipe breaks in a water zone.  BBN modeling is a critical step towards real-time distribution system management.  Development of expert systems for analyzing real-time data is not only important for pipe break prediction, but is also a first step in preventing water loss and water quality deterioration through the application of machine learning techniques to facilitate real-time distribution system monitoring and management.  Our model will be based on pipe breaks and covariate data from a mid-Atlantic United States (U.S.) drinking water distribution system network. The expert model is learned using a conditional independence test method, a score-based method, and a hybrid method, then subjected to 10-fold cross validation based on log-likelihood scores.

A short report from PSAM11/ESREL12 will follow in a later post.

In the world of chemical or human health risk analysis there seem to be several clouds forming over the horizon: mixture-based toxicology and interpretation, data-poor extrapolation to human exposure, and high-dose chronic to sub-chronic low-dose dose-response extrapolation.  These opportunities force us to approach risk analysis as an art, and necessitates the inclusion of decision analysis into chemical screening procedures.

One problem whose urgency is increasing is data-poor extrapolation from animal to human dose-response relationships.  Not only are there tens of thousands of compounds that are not regulated and have no publicly available data, but there are also entirely new types of chemicals produced by technological innovation for which existing toxicological approaches may not be appropriate.

Traditionally, risk scientists make this approximation (and similar others) by proposing a reference dose.  The reference dose (RfD) is an unenforceable standard postulating a daily oral human exposure for which no appreciable risk of adverse effects attributable to exposure to the given compound likely exist.  The reference dose is obtained from a point of departure for which either the lowest dose producing effects, or the highest dose for which no effects have been observed (i.e., LOAEL or NOAEL) that has been divided by uncertainty factors reflecting the uncertainties introduced by extrapolation between species and data quality contexts. Roger Cooke (and several commentators) discuss the RfD, concluding that the approach needs to be updated to incorporate probabilistic interpretation of these uncertainties, but there seems to be disagreement on how to update the RfD. In his Risk Analysis article, “Conundrums with Uncertainty Facors,” Cooke argues that this approach not only relies on inappropriate statistical independence assumptions, but that this is analogous to the engineering design application of safety factors.  By not employing a probabilistic approach, we promulgate uneconomic guidelines at best, while at worst we are overconfident in the in our risk mitigation.

Cooke’s paper illustrates a probabilistic approach to obtaining estimates of dose-response relations from combined animal, human, data-poor, and data-rich results in a chemical toxicity knowledge base founded on Bayesian Belief Networks (in his example, non-parametric, continuous BBNs).  He demonstrates the possibility of employing nonparametric or generalizable statistical methods to obtain a probabilistic understanding of the response of interest in the context of the chemical’s toxicological knowledge base.  This in in contrast to the uncertainty factor approach which presupposes there is only limited understanding of the dose-response relationship at relevant human exposures which we might hope to obtain.  While we are a ways away from abandoning the RfD approach, Cooke acknowledges that it may be difficult to rely only on dose-response modeling.  His approach initializes on current practice, while promising a rapid and simple inference mechanism capable of deriving indicators in toxicological indicators and amenable to inclusion in broader decision-making models.

Bayesian networks are remarkable graphical models for organizing joint probability distributions according to the conditional independence relationships extant in a dataset.  Another way of saying this is that people usually process information according to hypotheses linking the objects they observe.  People don’t connect objects intellectually unless the hypotheses connect.  One of my PhD advisors at Carnegie Mellon, Mitchell Small, introduced me to the Bayesian Network as a way to combine information from health effects studies to support risk assessment, but as a postdoctoral fellow, I started to look at Bayesian networks as a potential data mining tool.  However you look at them, they are elegant computational models with a compelling axiomatic basis for philosophical reasoning, to boot.  For me, they helped me understand and visualize Bayes’ rule as a graduate student, and now I’m hoping to use them more as a data mining technique to model drinking water distribution system reliability.  For these applications, I am thinking that learning the networks from my datasets will be indispensable.

OK, so learning Bayesian networks hasn’t been the exclusive focus of my thoughts in preparing this research.  Most of the past month has been a more thorough reading of the first few parts of Judea Pearl’s Probabilistic Reasoning in Intelligent Systems, but I have found a really cool paper by an Italian geneticist whose integrated several of the most popular network learning algorithms into an R package (bnlearn) for learning both the structure and parameters of a Bayesian network.  I originally came across his article last March or so when working with some JHU colleagues on using Bayesian Networks to predict missing data in a public hurricane loss model database, but we didn’t learn our network from data, and made some simplifying assumptions that did not require the sophisticated set of techniques in the linked paper.  Having read Marco Scutari’s paper several more times in the past week, I’m very impressed at the resource that he’s constructed.  It also helps a lot that it is in my favorite programming environment. There are many tools that learn either structure or parameters of Bayesian networks, but doing both at the same time has generally been left alone.  While Scutari’s package doesn’t do both at the same time, a researcher can come close, especially when using the bootstrapping or cross-validation utilities included in bnlearn.

Because of Scutari and bnlearn, I am excited to move further with the modeling I’m doing.  As an environmental engineer who wants to use computer science, not necessarily create it, I’m very pleased he’s made this tool available.