We have a great new series this week from guest bloggers John B. Cook and Edwin Roehl of ADMI on “Using Data to Build Process Models.” As always we’d love to hear your comments and feedback.
Difference between Data and Information
The power behind collecting data is not the ability to collect and store it; the power behind data collection is the ability to extract valuable information contained within it. Many utilities are “data rich” but “information poor.” Or, stated differently,utilities find themselves collecting rivers of data, while very little synthesis of the data is performed. Without data synthesis, valuable information necessary to enhance decision-making and improve processes is not made available to be converted to knowledge. However, with the newer information technologies presently available, such as real-time data collection and data mining, this need not be the case. Most utilities already have a wealth of data ready to be transformed into a rich source of information. Moreover, data collection is becoming easier with more robust and a greater variety of sensors with which to collect data.
Data mining has been defined as “the search for valuable information in large volumes of data. It is a cooperative effort of humans and computers”. The purpose of this introductory paper is to describe some of the techniques which are available to extract information from data to build process models and to demonstrate how large amounts of data have been evaluated using sophisticated data mining techniques that have enabled utility managers and other decision-makers to make prudent decisions involving reduced risk. It should be noted that there is a strong mathematical foundation for information theory, which we shall forego for the sake of brevity.
What Can We Know?
There are many natural environmental systems, treatment plants, and customer service operations that have accumulated large databases. However, given the limited resources and greater demands being placed upon utilities, decision-making becomes more critical than ever; hence, there is a need to transform these databases so as to leverage the information hidden in them. While these databases are generally underutilized, this need not be the case as both software and computational power exist to determine what is capable of being known about any process. When we speak of “process” we could be thinking of either a natural system, such as the movement of groundwater, or a water treatment plant or distribution system. But to know how a process works requires information extracted from collected data. By itself, data is simply a record of one or more physical processes such as a plot of temperature versus time, or stream flow versus time, or average water consumption per customer per day. In contrast, information in this context tells us about critical relationships among variables, probabilities of event occurrence, and how variables evolve over time—how fast and in what direction.
The process of extracting information from data is called “data mining”. Data mining is a powerful analytical approach which consists of synthesizing information from large databases used in fields as diverse as financial services, banking, customer service, biology, manufacturing and e-commerce. Data mining is also used in environmental fields such as water and wastewater treatment, water quality modeling, hydrology and demand forecasting. Over the course of this series, we shall examine how data can be used to build various process models such as:
- Minimizing DBP formation
- Optimizing coagulation
- Demand forecasting
- Optimizing filtration performance
- Determining groundwater movement
- Developing regional TMDLs
Data mining techniques include advanced multivariate analysis and modeling, multi-dimensional visualization, machine learning, artificial intelligence and artificial neural networks, a type of AI. By building process models, some of the hypothetical questions which data mining could be used to answer include, but are not limited to:
- Why does a wastewater or water treatment plant process run better at some times than at others? At times, the answer is obvious; at other times, no logical explanation is apparent.
- What are the critical factors to ensure that a wastewater treatment plant (WWTP) is able to nitrify or remove nutrients consistently?
- How can water use demand be accurately predicted, and what will happen if certain conservation measures are taken, if the weather changes, or if there is a change in the rate structure, population growth, etc.?
- What is the optimal operational and backwash mode for filtration?
- How can the coagulation process be optimized to remove TOC?
- What is the optimal use of limited water resources in a drainage basin?
- How can costs be reduced without adversely impacting service levels?
- How can the impacts from point source versus non-point source pollution be accurately separated?
- What can be done to reduce disinfection by-products without major capital expenditure?
- How can customer service records be used to enhance service levels?
- How can data mining techniques be used to help develop an asset management program or program for predicting critical failures?
The above examples are a minor sampling of all possible uses of databases to extract critical information in helping utility managers, engineers, resource managers, operators and financial analysts to make sense of what is happening on a larger scale and longer timeframe. While the above sample illustrations are different, they have much in common. First, conclusions are not apparent when first beginning each project, that is, collecting large quantities of data and looking at simple trends will yield a limited amount of information. Second, the information to make decisions is extracted from large databases covering a variety of conditions. And third, the information extracted is used to build a process model and/or decision support system (DSS), all of which would enable decisions to be made accurately, rapidly, and reflecting the wisdom of sound operational history.