This article, written by Edwin Roehl, CTO (primary) and John Cook, CEO (secondary) of Advanced Data Mining Intn’l, describes yet another, more complex, case study of Utility A in which data from upstream monitoring sites are used to account for water quality (WQ) variability at a “target site” that, if unaccounted for, “appears” to vary non-deterministically. This is another way of seeking to filter out undesirable signals with the hope of finding “events” of importance such as are caused by intentional or unintentional contamination. Previous findings have determined:
- Non-determinism, also known as “apparent randomness”, is prevalent in WQ signals. This largely undermines the goal of reliable event detection by making it impossible to discriminate between “normal” and “abnormal” process behavior.
- Data from upstream sites can provide “boundary conditions” that can be used to explain/account for most (80%+ from R2) of the variability at a downstream site. This presented the possibility that downstream WQ signals can be filtered to remove all “known” causes of variability, allowing event detection to proceed on filtered signals that manifest only “unknown” causes. The means for performing the accounting has been multivariate, nonlinear empirical models developed using multi-layer perceptron artificial neural networks (ANNs). The models use multi-spectral representations of input signals with time delays.
The accounting modeling work previously reported has merely raised the possibility of benefiting event detection. The analyses presented here are more comprehensive and informative on this matter. The results are mixed but promising:
- 4 Sites at Utility A – these 1-minute data represent a monitoring configuration of two booster pump stations and two tanks as shown in Figure 1. The accounting problem was to predict the WQ at BPS 14 (hereafter BPS14) from local-to-the-site operational parameters, and operational and WQ parameters at BPS 9 (hereafter BPS9), EST 9 (hereafter TANK9), and EST 14 (hereafter TANK14). The parameters used in this analysis are indicated in Figure 2, where Q=flow, LVL = tank level, PSUC = pump suction pressure, PDIS = pump discharge pressure, COND = specific conductance, CL2 = residual total chlorine concentration, and TEMP = water temperature.
Utility A operates three WTPs having different sources, and Figure 1 indicates two main junctions amidst the four study sites, which are sources of unmeasured forcing. The earlier results are shown in Figures 3 and 4. The problem is revisited here to determine if filtering the WQ signals would actually aid event detection.
UTILITY A SITE ANALYSIS
The 4-site 1-minute Utility A data spanned approximately one year. They were converted to 4-minute data by applying a moving window average and then sub-sampling. This was done to greatly reduce the turnaround time of generating and evaluating different ANN prototype model configurations. The researchers do not believe that this would qualitatively compromise this 4-site re-analysis and, despite much extra effort, the new models turned out to be only slightly better predictors than the previous ones. The raw 1-minute signals had previously been corrected to remove questionable data for earlier analyses. Only the COND and CL2 parameters were analyzed because of questionable quality of the other WQ parameters, PH and turbidity, across the four sites.
The general modeling/filtering approach was the same as for the Utility D analysis previously provided in the blog, and was composed of the following:
- Segmenting the 1-year data set into training data and a continuous test data segment. The training data were the last two months of the set.
- Development of process models that predicted the BPS 14 COND and CL2 from inputs representing the dynamics of the operational parameters at all four sites (Q, PSUC, PDIS, LVL), and the boundary condition WQ parameters (COND, CL2) at the BPS9, TANK9, and TANK14.
- Development of a hybrid process+autoregressive model, akin to the “final model” in the Utility analysis above, to predict the BPS14 COND and CL2 MWDs. The process inputs were calculated from the (2) process model predictions, and the autoregressive inputs were calculated from the BPS14 COND and CL2, but time-delayed to be outside the output MWD calculation windows. Using autoregressive inputs at smaller delays would risk having the predictions simply track the variability caused by a real event, making it go undetected.
Figures 4 and 5 show the process model predictions and performance statistics of BPS14 COND for the training and test data respectively. Figures 6 and 7 show the same for the BPS CL2. Additional statistics are provided in Table 1, where the COND and CL2 process models are #1 and #7. Note the latter two thirds of the testing COND and CL2 data (Figures 5 and 7) exhibit long-term “hump” and “trough” trends respectively. These are possibly a consequence of operational changes that route water to the area from different WTPs, one of whose sources is more saline than the other two.
Table 1. Utility A ANN model statistics. “trn” and “tst” subscripts denote statistics related to the training and test data respectively. “MWD win” is the window size used to compute outputs that are MWDs; “# INPUTS” is the number of signal components used (after culling); HLN = number of hidden layer neurons used, and is a measure of fit articulation; and N = number of input-output vectors.
The Utility A site analysis indicates that the multi-site accounting approach has less of a chance of being useful as the array of monitoring sites becomes more complex. The Utility A configuration comprised four monitoring sites as well as at least two unmonitored junctions (Figure 1). The Utility A analysis also indicates that accounting for the higher WQ high frequency variability which is necessary for detecting events within the 20-minute target window, is more challenging than for lower frequencies. This suggests that monitoring sites and locations might need to be specifically engineered to better support the event detection goal.