Here is another post in the water quality and water security blog series “Lessons Learned” by John Cook and Edwin Roehl Jr from their research at ADMI. This specific post was penned by Edwin with John serving as co-author.
As was demonstrated in an earlier blog posting, there is considerable “apparent randomness” in signals to detect distribution system water quality. The expression “apparent randomness” is used because much of this behavior can be explained going back to the water treatment plant and looking at changes in chemical dosing, weather conditions and source water changes, etc. To investigate the effectiveness of using upstream monitoring data to account for variability in downstream signals, an analysis was performed using multi-site data from unnamed Utility D. The technical approach used to perform the accounting were the artificial neural network (ANN) models having multi-spectral representations of input parameters. Here, the model outputs are water quality (WQ) parameters at a downstream site, and the inputs represent operational parameters “local” to the downstream site, and “boundary condition” operational and WQ parameters measured at upstream site.
Figure 5 shows Utility D WQ data from an upstream site (SiteD.1) and a downstream site (SiteD.2). The data as obtained has a 1-hour time step and one would ideally like to have more frequent polling, but it illustrates the concept. The data spans a seven month period from late summer through winter, with long-term increasing or decreasing trends, and generally higher frequency variability at SiteD.1, being apparent. As indicated in Figure 5, a 42-day (1,000 hours) segment from within the range of the data was set aside for evaluating modeling results.
The modeling goal was to predict what the values of the SiteD.2 WQ parameters should be at the current time step based on the current and recent past WQ variability at SiteD.1, and the recent past variability at SiteD.2. The idea is that if consistently accurate predictions can be generated using known causes of variability, an especially inaccurate forecast would signal an event. The modeling procedure and its results are described below.
ANN “process” models were developed to predict the SiteD.2 COND, CL2 (free chlorine), and pH (PH) using only SiteD.1 TEMP, COND, CL2, and PH. No operational data for these sites was available. For inputs, the upstream parameters were decorrelated and then decomposed into different spectral components, representing from hourly to daily variability. Figure 6 shows the measured and predicted downstream WQ time series, with the models’ prediction errors, also called “residuals”. Figure 7 details the test data results. All three models generally predict the up and down movement of 24-hour cycling and longer-term variability; however, higher frequency downward spiking at SiteD.2 is not predicted because it is not manifest in the SiteD.1 data. The cause of the spiking must have originated downstream of SiteD.1.
A number of prototype model input configurations were compared for predicting the SiteD.2 WQ 1-time-step moving window differences (MWD = current value – value at “win(dow)” time units ago), called “D1”. In Table 1, the variability of the Site 2 D1s is characterized by their RMSs (root mean square = square root of the mean of a parameter’s squared values). Note that the RMSs are two to three times higher for the training data than the test data. In addition to the dimensionless R2, the residuals can be characterized by the RMSE (root mean square error = the RMS of the prediction errors/ residuals), which is effectively an average error that has the same units as the output parameter. The residuals represent variability in the output that is not explained by the inputs, which here also represent a reduced form of the downstream D1s.
The R2s in Table 1 indicate that 70-90% of the variability of the SiteD.2 D1s is accounted for by the final models. Figure 8 compares measured and test D1s with the final models’ residuals during the spikes. It shows that the spikes are significantly reduced in the final models’ residuals. Also shown are the residuals of the less effective prototypes that used only SiteD.1 inputs.
Table 1. Utility D final model statistics. “trn” and “tst” subscripts denote statistics related to the training and test data respectively.