This is the second installment by John Cook and Edwin A. Rohl, Jr. of Advanced Data Mining International. Their water quality research has enormous impacts to water security and event detection to improve the knowledge of operators, and ultimately provide benefits to public health. I hope you enjoy!
“Spot” sampling is a common practice in the water industry. The frequency of spot sampling can be based on: regulatory requirements, such as weekly and monthly total coliform and quarterly disinfection byproducts reporting; a matter of convenience, such as once per shift; or it can be more arbitrary. It is easy to imagine that problem conditions such as a very low disinfectant concentration or an overdosing could be missed by a spot sampling regime that is insufficiently frequent. SCADA systems that poll online instruments at adjustable frequencies can readily manifest problems, which are evident for tens of minutes or hours. However, a problem that might be apparent for only a few minutes, such a contaminated slug traveling past sensors in a pipe, would be undetectable at a 5-minute polling interval. Therefore, to effectively monitor a process, utility personnel must answer questions about the types of problems they want to detect: what types of sensors are needed; where should they be placed; and what sampling frequency is needed?
The sampling frequency can be estimated by evaluating time series data directly. A starting point is provided by the Nyquist-Shannon sampling theorem, which dates from the 1940’s. Harry Nyquist was a contributor to the field of “information theory”, whose establishment is credited to Claude Shannon of Bell Labs. The theory provides much of the foundations for the fields of signal processing and computing. The theorem suggests that the sampling frequency needed to detect a behavior must exceed half the behavior’s duration. For example, above in Figure 1 is a chlorine residual signal that was polled at 1-minute intervals. A number of “features” have been labeled in the detail below. According to the sampling theorem, the 2-minute feature would require a sampling frequency higher than the 1-minute poling frequency, while sampling frequencies for the 23, 75 (1.25 hr), and 90 (1.5 hr) –minute features would be 11, 37, and 44 minutes when rounded down to the nearest minute. A polling frequency of say, 30 seconds would be needed to make them all detectable. The reason these features are so visually discernable is because of large trend shifts that are not always present in process signals. An automated detection scheme would leverage multiple ways that signals are characterized.
The sampling theorem is really true only for idealized cases and most applicable to signals with limited noise. The field of chaotic time series analysis offers an approach that better accommodates noisy signals. The approach suggests that the sampling interval be the “time delay” td at which a signal is statistically independent of a copy of itself (not auto-correlated). Parsimoniously, each new sample carries unique information that is separable from the accompanying noise. As shown in Figure 2, td for a sine wave is ¼ of its period, whereas the sampling theorem indicates an interval twice as long. This indicates that if the sine were to degenerate into a noisy signal, shorter sample intervals would be needed to manifest the remaining information.
We are reminded of wastewater plant operators who take a single chlorine effluent sample per shift, only to find regular fecal coliform permit violations. And one can see that the spot sampling frequency, which appears simple on the surface, is really quite challenging in order to be able to detect the behavior of a water quality time series. Further, one can readily see the importance of its frequency in helping to maintain water quality.