Data Analysis – Strategic [Urban] Ecologies, Engineering, and Decision-Making

New SEED paper reporting on use of Bayesian Networks to model water pipe breaks!

seedMay 13, 2014Leave a comment

We have recently had our article, "Bayesian belief networks for predicting drinking water distribution system pipe breaks," accepted for publication in Reliability Engineering and System Safety. It is now available online from the publisher.

This was one of the most rewarding papers I've written, because it allowed me to learn so much more about one of my favorite modeling techniques, the Bayesian Network. Specifically, the challenge of this paper is in learning the network from the data, instead of taking the more popular approach of assuming a network structure a priori. I am still not finished investigating the use of Bayesian Networks in infrastructure data problems, but I'm excited about this first step.

The abstract is quoted below:

In this paper, we use Bayesian Belief Networks (BBNs) to construct a knowledge model for pipe breaks in a water zone. To the authors’ knowledge, this is the first attempt to model drinking water distribution system pipe breaks using BBNs. Development of expert systems such as BBNs for analyzing drinking water distribution system data is not only important for pipe break prediction, but is also a first step in preventing water loss and water quality deterioration through the application of machine learning techniques to facilitate data-based distribution system monitoring and asset management. Due to the difficulties in collecting, preparing, and managing drinking water distribution system data, most pipe break models can be classified as “statistical-physical” or “hypothesis-generating.” We develop the BBN with the hope of contributing to the “hypothesis-generating” class of models, while demonstrating the possibility that BBNs might also be used as “statistical-physical” models. Our model is learned from pipe breaks and covariate data from a mid-Atlantic United States (U.S.) drinking water distribution system network. BBN models are learned using a constraint-based method, a score-based method, and a hybrid method. Model evaluation is based on log-likelihood scoring. Sensitivity analysis using mutual information criterion is also reported. While our results indicate general agreement with prior results reported in pipe break modeling studies, they also suggest that it may be difficult to select among model alternatives. This model uncertainty may mean that more research is needed for understanding whether additional pipe break risk factors beyond age, break history, pipe material, and pipe diameter might be important for asset management planning.

SEED Group Presentation at Concordia University Today, 11AM

seedJuly 6, 2012Leave a comment

Today, Dr. Francis is giving a talk titled "Two Studies in Using Graphical Model for Infrastructure Risk Models" discussing some recent peer-reviewed conference papers given at ICVRAM and PSAM11/ESREL12. The abstract for today's talk is:

In this talk, I will discuss the use of Bayesian Belief Networks (BBNs) and Classification and Regression Trees (CART) for infrastructure risk modeling. In the first case study, we focus on supporting risk models used to quantify economic risk due to damage to building stock attributable to hurricanes. The increasingly complex interaction between natural hazards and human activities requires more accurate data to describe the regional exposure to potential loss from physical damage to buildings and infrastructure. While databases contain information on the distribution and features of the building stock, infrastructure, transportation, etc., it is not unusual that portions of the information are missing from the available databases. Missing or low quality data compromise the validity of regional loss projections. Consequently, this paper uses Bayesian Belief Networks and Classification and Regression Trees to populate the missing information inside a database based on the structure of the available data. In the second case study, we use Bayesian Belief Networks (BBNs) to construct a knowledge model for pipe breaks in a water zone. BBN modeling is a critical step towards real-time distribution system management. Development of expert systems for analyzing real-time data is not only important for pipe break prediction, but is also a first step in preventing water loss and water quality deterioration through the application of machine learning techniques to facilitate real-time distribution system monitoring and management. Our model is based on pipe breaks and covariate data from a mid-Atlantic United States (U.S.) drinking water distribution system network. The expert model is learned using a conditional independence test method, a score-based method, and a hybrid method, then subjected to 10-fold cross validation based on log-likelihood scores.

This talk is hosted by Ketra Schmitt in the Center for Engineering in Society on the Faculty of Engineering and Computer Science.

New SEED Group Paper Presented at PSAM11/ESREL12 in Helsinki, Finland on 27 June 2012

seedJuly 4, 2012Leave a comment

A recent SEED Group paper, "Bayesian Belief Networks for Predicting Drinking Water Distribution System Pipe Breaks" was presented at PSAM11/ESREL12 in Helsinki, Finland. This peer-reviewed conference paper was co-authored by Dr. Francis with JHU collaborators Dr. Seth Guikema and Lucas Henneman. The abstract of this paper follows:

In this project, we use Bayesian Belief Networks (BBNs) to construct a knowledge model for pipe breaks in a water zone. BBN modeling is a critical step towards real-time distribution system management. Development of expert systems for analyzing real-time data is not only important for pipe break prediction, but is also a first step in preventing water loss and water quality deterioration through the application of machine learning techniques to facilitate real-time distribution system monitoring and management. Our model will be based on pipe breaks and covariate data from a mid-Atlantic United States (U.S.) drinking water distribution system network. The expert model is learned using a conditional independence test method, a score-based method, and a hybrid method, then subjected to 10-fold cross validation based on log-likelihood scores.

A short report from PSAM11/ESREL12 will follow in a later post.

What’s the best way to learn Bayesian Networks from data?

seedSeptember 24, 2011Leave a comment

Bayesian networks are remarkable graphical models for organizing joint probability distributions according to the conditional independence relationships extant in a dataset. Another way of saying this is that people usually process information according to hypotheses linking the objects they observe. People don’t connect objects intellectually unless the hypotheses connect. One of my PhD advisors at Carnegie Mellon, Mitchell Small, introduced me to the Bayesian Network as a way to combine information from health effects studies to support risk assessment, but as a postdoctoral fellow, I started to look at Bayesian networks as a potential data mining tool. However you look at them, they are elegant computational models with a compelling axiomatic basis for philosophical reasoning, to boot. For me, they helped me understand and visualize Bayes’ rule as a graduate student, and now I’m hoping to use them more as a data mining technique to model drinking water distribution system reliability. For these applications, I am thinking that learning the networks from my datasets will be indispensable.

OK, so learning Bayesian networks hasn’t been the exclusive focus of my thoughts in preparing this research. Most of the past month has been a more thorough reading of the first few parts of Judea Pearl’s Probabilistic Reasoning in Intelligent Systems, but I have found a really cool paper by an Italian geneticist whose integrated several of the most popular network learning algorithms into an R package (bnlearn) for learning both the structure and parameters of a Bayesian network. I originally came across his article last March or so when working with some JHU colleagues on using Bayesian Networks to predict missing data in a public hurricane loss model database, but we didn’t learn our network from data, and made some simplifying assumptions that did not require the sophisticated set of techniques in the linked paper. Having read Marco Scutari’s paper several more times in the past week, I’m very impressed at the resource that he’s constructed. It also helps a lot that it is in my favorite programming environment. There are many tools that learn either structure or parameters of Bayesian networks, but doing both at the same time has generally been left alone. While Scutari’s package doesn’t do both at the same time, a researcher can come close, especially when using the bootstrapping or cross-validation utilities included in bnlearn.

Because of Scutari and bnlearn, I am excited to move further with the modeling I’m doing. As an environmental engineer who wants to use computer science, not necessarily create it, I’m very pleased he’s made this tool available.

Does Big Data make the scientific method obsolete? I hope not?

seedJuly 26, 2011Leave a comment

I came across a link posted by @urbandata on Twitter asking the question, "Does 'big data' make the scientific method obsolete?" My immediate response before clicking the link was, "I sure hope not." After reading the article, I think it may be a bit more complex than that, but stand by my original impression.

The article "The end of theory: The data deluge makes the scientific method obsolete" can be found here: "Does big data make the scientific method obsolete?"

I think the author, Chris Anderson, rightly points out that correlation must not be confused with causation, but he continues without exploring the full meaning of this statement. As a result, he builds an argument that rests on the wisdom of this traditional warning without intending it.

For example, Anderson uses Craig Venter's successful "shotgun sequencing" method to DNA sequencing as an example, yet doesn't realize that the established theory that species are uniquely identified by their genome makes this approach valid. More than that, it lends credence to the author's later observation that organisms don't need to be directly observed to learn about their characteristics. The author can make this claim for the same mechanistic reason the shotgun approach works.

This is not to say that the use of statistical and mathematical models to analyze ubiquitous data around us does not extend the scientific method in ways that we don't yet imagine. It does. However, science provides not only the foundation for the mathematical theories underlying statistical methods, but it also helps us to interpret the data streams and statistical results. Yes, we should strive to change the way science works, but we should not abdicate responsibility for inquiry and investigation to the black box.

[This post also appears on my personal blog, the fertile paradox...]