Hi Water Quality and Water Security Followers,

We have another series dealing with Water Quality that will feature Eyal Brill, PhD. He is the Chief Developer of the BlueBox Intelligent Event Detection System from Whitewater’s Quality and Security Division and Professor at Holon Institute of Technology. He will be providing us with his insights on statistics and modeling techniques used by the water industry in order to detect water quality events. This will be a bi-monthly series. I hope you all enjoy.

**Background:**

A water quality event is a situation in which the quality of water prevents safe distribution for drinking water purposes. When such a situation occurs a few rules of thumb are of keen interest:

- It is highly important to provide an early warning to such events.
- Ideally, we would like to predict an event before it happens (ie. in the case of pipe bursting)
- Event types should be classified as organic, inorganic, or a water source change.

The spatial effect of the event should be evaluated, in order to assist decision makers in the situation management process.

…That being said,

Water quality measurement values are assumed to have “normal distribution”. The term normal refers to a situation in which 99% from the samples should be found around +/-3 standard deviations around the average. Is it true?

Let’s examine some data from the actual world. Figure 1 shows turbidity distribution from actual customer site. The distribution is based on 40,000 samples (polling interval 1 minute) taken over a period of several months.

The horizontal axis shows the turbidity measurement. The vertical axis shows frequency. The red line is continuous fit for normal distribution. The green line is continuous fit for gamma distribution. Even untrained eye can detect that the green line fits the discrete distribution better than the red line. So what is the reason for this?

Normal distribution assumes that samples from both side of the average have equal probability to occur. In our sample the average is 0.3 with standard deviation of 0.26. Thus, according to the definition of normal distribution a value of -0.1 should occur also. However, we know that turbidity with value of -0.1 does not exist!

Moreover, according to the definition of the normal distribution 99% of the samples should be in the boundary of a value less than 1.08 (0.3 plus three standard deviations). However, we know from actual data that more than 1% of the samples have value greater than that!

Gamma distribution on the other hand fits more the description of the data. This distribution is a member of a family of distributions called “long tail” distributions. This family fits situations in which some rare members may be 5,10 even 20 standard deviation from the average.

The most common example for such distribution is income distribution. Assuming that you are earning the average income, haven’t you seen people with twenty times more that your income(?!).

Conclusion: water quality measurements in many cases do not have “Normal distribution”. As such its analysis should adjust. Otherwise valid samples will be classified as abnormal mistakenly.

Thanks to Dr. Brill for his fine article. Of course, we find that many variables monitored either in the environmental sciences or in continuous flow processes such as the turbidity example simply do not have normal distributions. So looking for outliers based upon so many standard deviations can be very misleading.

Well, in my experience of WWTP effluent data, in periods of “normal, smooth” operation, including natural phenomena like rain events, WQ data are either normal or log-normal depending on the variable. But if you look at longer time series (years) the tails are definitively not fitting the distribution, and I am convinced that this is due to the “human factor” (mistakes, sudden changes in operation…) or equipment reliability issues (again human anyway…). See e.g. the WERF LOT study.

Comment Threads from Across LinkedIN:

Water Environment Federation:

Carl McMorran • This is false! I’ve spent years analyzing a wide variety of water quality data and have yet to observe any data set that fits a normal distribution. As a result, I’ve abandoned using any parametric statistics, except where required by regulations. Non-parametric statistics (median, percentiles, etc.) has much more descriptive power for evaluating water quality data.

——————————————————————————————————

Curt Craigmile • At best, most water quality data sets need to be manipulated to try to force them into a normal distribution. One way to do this is to look at “log normal” distributions, or other mechanisms to change how the data looks. A fundamental problem with this approach is that you are working with numbers that don’t really have that much meaning. For instance, there is good science available linking ammonia concentrations within a stream to the stream’s health, but those correlations are blurred when you look at a data set of log ammonia concentrations.

In an ideal world, the conversions between the raw data set to the “normalized” data set and back again should be seamless, but you will find that even after manipulating data there are enough outliers to provide the statistical analysis with deviations that are greater than you may be comfortable with. These deviations can become magnified by the conversions which can make the whole process problematic.

As the industry moves to more stringent treatment requirements, I believe that it makes sense to recognize these issues and develop standards accordingly. Establishing a standard of having to treat down to 1.0 mg/l of a certain pollutant 95% of the time may be much more cost effective and offer better protection of the environment than a standard that requires 1.5 mg/l of that pollutant 100% of the time.

———————————————————————————————————–

John B. Cook, PE • @Curt and Carl,

Indeed, most continuous flow process monitoring reveals data which fits log normal versus normal distribution and does ambient water quality monitoring. That is Dr. Brill’s thesis, and one that is in accord with my observations as well.

Regarding the concept of limits set as a percentile, that is completely consistent with the idea of log normal distributions and a more reasonable way to proceed for permit limitations. That is also the structure of regulations associated with the Lead and Copper Rule and, of course, Dr. Brill’s example of turbidity, in which 0.3 NTU must be met 95% of the time. There is also an absolute limit of 5 NTU for turbidity. This is applicable in the US and I’m afraid I’m not sufficiently versed in what Europe is doing. The point is, that in some cases, this approach is being taken already, but certainly not in all cases.

———————————————————————————————

Eyal Brill • True.

However, most SCADA systems use SPC methods which are based on normal distribution, standard deviation and so on. This is why it is important to raise this issue.

————————————————————————————-

John B. Cook, PE • @Dr. Brill,

The original application for SPC was discrete part manufacturing if memory serves. Hence, normality could be reasonably assumed and autocorrelation of sampling was not an issue. There are ways to modify SPC for continuous flow processes, but the standard assumptions can’t be used. But if modified, it would be an improvement over current practice.

————————————————————————————————–

Johnny Qing Sheng Ke • If we measure the value of a parameter with a FIXED true value, the measured results should be normally distributed. However, in waster treatment processes, the outcomes of the processes are not FIXED. The outcome varies due to the variations in the incoming flow and loadings. In a process design, the process unit is sized for the design flow and loading e.g. maximum month average day flow and loading, at a certain time being, the flow and loading may be between minimal to max, hence the outcomes vary.

Agree with John Cook, SPC is for a well controled manufacturing environment.

Every data set should be carefully checked if it is a normal distribution. If it is not a normal distribution, transformation of the data set (like taking log) can be used to make the data set become a normal distribution or a close to normal. (And sometimes none of the possible transformation can make the data set fit into a normal.) The data analysis should be conducted on the transformed data to generate the statistics. I just analyzed 2 months’ daily UVT data from a MBR filter for a UV reactor design, the raw data set fits a normal distribution.

——————————————————————————————————————-

@Johnny,

Thanks for your post. Yes; the world is not a neat and tidy place. Processes rarely work in steady state, unless highly controlled. This is not the typical case with water or WWTPs with fluctuations in the flow and quality required to be treated. The purpose behind knowing a distribution is, among others, to determine the extent of outliers and how to change the process to minimize outliers and bring the distribution closer to the y-axis in the case of turbidity; in the case of another variable that is normally-distributed, it might mean trying to minimize variability around the mean. We see this idea in the quality movement–6 Sigma, etc. Of course, I am stating the obvious, but perhaps it’s not obvious to everyone.

———————————————————————————————————-

Sewage and Wastewater Professionals

William Randall • In the wastewater work I have done, I have evaluated numerous data sets of influent and effluent looking at that very question. There are a number of situations where the normal distribution is an adequate model for wastewater characeristics. For effluent quality, however, especially for constituents where we continue to push toward lower values, the normal assumption is impossible to justify – yet is the assumption for most regulation. Something more chi-squared in appearance, with a high peak near the low values (zero) and a long tail has shown to be more representative in some of the data I have asessed. The consequences are enormous if regulations are based on maximum values or simple averages. The same issues emerge in water quality as we look at microconstituents – there is a limit at the low-end (zero) but an “unbounded” upper value. The requirements to reliably meet ppb and lower levels must be based on statistics that are more descriptive than simple averages if they are to be realistic scientificially or economically.

———————————————————————————————————-

ISA Water & Wastewater Industries Division

Jake Brodsky • First, you’re dealing with the bottom edge of an instrument’s scale. Get a particle counter instead of a turbidity meter. Your math is the result of an instrumentation artifact.

Second, I think something like this ought to be displayed on a logarithmic scale.

Third, turbidity depends upon lots of natural events. Asserting that it has a normal distribution presumes that the weather and climate itself has a normal distribution.

Fourth, you’re looking at just shy of a month’s worth of data. Weather can change in that period of time. Why not study what it looks like over an entire year and then get back to us?

——————————————————————————————————–

Alex Ekster Ph.D.,P.E.,B.C.E.E. • influent and effluent parameters usually follow log-normal dustribution, state parameters can be normally ditributed. For parameters that do not follow any distribution non-parametric statistics may apply using median instead of mean,

———————————————————————————————————-

John B. Cook, PE • Jake,

I would concur with Alex’s statement above. An effluent parameter such as finished water turbidity tends, time and again, to yield a log-normal distribution. If the filter performance is very poor, then the distribution could be normally distributed, not this is unusual. I have up-rated filtration systems looking at years of data and hundreds of turbidity histograms, and yes, one sees a log-normal distribution with the exception of poorly-performing filters. Whether or not turbidity should be displayed on a log-normal graph or not is up for discussion, but the fact that it would look normally-distributed if displayed on a log-scale makes the case that its not normally distributed, doesn’t it?

I certainly agree that weather effects virtually all aspects of an operation, most of which we do not recognize. But, if referring back to an earlier blog post, you will see that turbidity on a distribution system, irrespective of what is leaving the WTP, is highly variable because of unaccounted for forcing functions which include, among other things, weather-related forces such as temperature and barometric pressure. In general, weather is both periodic and chaotic and both signals effect water quality in the source of supply and the treatment process. Weather tends toward periodicity with respect to temperature, but tends to be chaotic otherwise. Think of the early work of Edward Lorenz of MIT on trying to predict the weather. We still can’t accurately predict the weather beyond several days and that’s with all the sophisticated weather monitoring and satellite data. So your point that weather is an important forcing function which is not optimally considered in process control is a good one. It should be.

————————————————————————————————————-

Jake Brodsky • I think we all realize there are many other parameters to consider with filtration: Detention time, media age, influent turbidity integration versus time since last backwash, filter area per volume of water…

There are so many parameters here that it is almost axiomatic that there would be a normal curve in there somewhere.

Also note that I did state in my second point (though perhaps not as eloquently as Alex) that turbidity should have been plotted on a log scale. That makes my issue of instrument artifacts all the more relevant.

I know we have to use turbidity meters because we have legal documentation to make, but from a purely process oriented perspective, a laser particle counter system offers a great deal more resolution and it may be better suited for identifying the condition of a filter.

———————————————————————————————————

Brian Rutledge • More than likely a log-normal distribution would be better fit. I think people might gravitate towards normal distributions because they are more likely to know how to do the math.

Think about this: According to Wiki, “A variable might be modeled as log-normal if it can be thought of as the multiplicative product of many independent random variables each of which is positive.” http://en.wikipedia.org/wiki/Log-normal_distribution

Each event that contributes to a waste flow can be thought of as something that contributes either a positive or zero flow (never a negative) with a positive or zero (never negative) solids, chemical analysis, BOD, COD etc for a certain durations of time. These discrete events added together result in the overall water quality numbers.

Many times a distribution that is the result of “counting” discrete non-negative events can be described using a log-mean distribution. Other examples are particle size distributions of slurry products (counting particles at various sizes), or Elmendorf Tear (which is in effect “counting” bonds each having different strengths).

I look forward to discussion on this topic.

—————————————————————————————————————-

Water Industry Process Control and Automation (WIPAC)

Besancon, Axelle • Hi Noah,

Good question, I’d say it depends on the space/volume and time scale. Over a year for instance, you always have external variations with time such as temperature, seasonal activities or evolution (soil, biomass…) and I think it might be true for both treated wastewater and fresh water. But maybe over 10-20 years of intensive sampling the variation is flattened?

——————————————————————————————————————-

John B. Cook, PE • @Axelle,

We should not at all be surprised to see log-normal or skewed distributions in the natural environment. Think for example, of microbial sampling in which some very large spikes can occur, particularly following a rain event. The distribution will not be normal. As regards to biological wastewater treatment, the majority of values, whether for BOD, TSS or FC, will be on the low side, but as a result of weather conditions and flow variations, it is common to see spikes in each of these. These numbers can never go below zero, but they can surely peak upward, particularly fecal coliform. So, here are yet other examples of non-normal distributions which are true irrespective of time frame. Many other examples could be given.

The exception could in theory be an industrial WWTP in which flows are constant and treatment is based upon chemical removal or nanofiltration which would tend to attenuate peaks.