Here is the second water quality monitoring post by guest blogger, Eyal Brill PhD, who is the Chief Scientist and Developer of the BlueBox Intelligent Event Detection System for Whitewater’s Quality and Security Division, and Professor at the Holon Institute of Technololgy in Israel.
Models for water quality event detection are mainly statistical models. They try to capture historical trends and events in order to detect if something abnormal is occurring. The life cycle of such a model includes a constant inspection of its results in order to be able to answer a simple question: How do I know that my model is good and when it is time to recalibrate?
A common approach to this issue is to use the ROC curve. The ROC is the ratio between True positive and false positive. In other words we are examining how many times my model triggers false alarms relative to the amount of TRUE alarms it fires for a given period of time. Figure 1 explains this approach.
The horizontal axis counts the false positives. The vertical axis counts the TRUE alarms. The blue curve shows the ratio between the two. It can be seen that in order to detect 80 true alarms the system generates about 10 false alarms. This is a ratio of 1 to 8. The more curved the blue line is, the better the detecting system.
Question: Is a model with an infinite ratio of ROC a good one? In terms of figure 1, if the blue line would have climbed directly from zero to 100, would it make us happy?
Answer: Always NO!!
Why? The answer to that is based on two (2) reasons.
The first is related to the fact that the weight of TRUE positive and FALSE NEGATIVE should not be equal. 9 out of 10 managers of a water utility would say that they can tolerate false positive to some degree, but they have zero tolerance for a false negative. (i.e. they can’t accept a situation in which customers received contaminated water because the water distribution system failed to detect it.) A false positive may cause a loss of money and organizational irritation, however a false negative may lead to loss of life!
Conclusion: So we need some way to weight differently a False positive and a false negative.
But this is not the end of the story. Imagine two systems that have exactly the same ROC. But the first one achieves this result in a data set of 10,000 records while the second one has the same ratio with a dataset of 1,000,000 records. Clearly that the second one is better. Why? The more records you process the higher the chances are for mistakes. If a system runs ten times more records and the amount of mistakes remains constant, it is better.
Question: Is there a statistical method that covers all of these scenarios? In other words, is there a measurement that takes into account the True positive, False positive, False negative and True negative?
Answer: Yes. Choen’s Kappa.
Assume that we ask two experts to express their opinion about a group of individuals who apply for a job. Some of the candidates will be rejected by expert A or B. Some by both and some will be accepted by both. The Kappa measures the chances that both experts agree randomly or based on concrete information. The Kappa has a range of 1 to -1. A value of 1 means that the two experts completely agree. A value of -1 means that the two experts completely disagree. This indicates that any candidate that was accepted by expert A was rejected by expert B and visa-versa. A value of zero (0) means that the agreement between the two experts is random.
…Now, how is this related to our case? Table 1 below explains.
The two experts in our example above are the Model and the actual data. The middle cells in table 1 show all four possible options between the two “Experts”.
The value of Kappa is calculated by the following formula:
K = (P – Pe)/(1-Pe)
Where P is the proportion of units where there is agreement, and Pe the proportion of units which would be expected to agree by chance.
The following numerical example will make it clear. Table 2 gives the performance of a model.
For example 61 times the model said that the water is not ok and rightly so, as the actual data revealed the water was not ok.
The Pe value is calculated by [ (63-67)/94 + (31-27)/ 94]/ 94 = 0.572
The P value is calculated by (61 + 25)/94 = 0.915
And thus the Kappa value will be: (0.915 – 0.572)/(1-0.572) = 0.801
Statistical theory has ranked kappa values about 0.8 as EXCELLENT, values between 0.6 to 0.8 as GOOD, values between 0.4 to 0.6 as FAIR and values below 0.4 as POOR.
Two final remarks.
Question 1: What do we put in the cells of table 1?
Answer: Days, hours, events.
Question 2: How do we weigh false negative more.
Answer: We multiply its cell by factor.
If you’re interested in learning more about the BlueBox Intelligent Event Detection System, you can download the datasheet from the Box.Net media kit application in the right side tool bar of the blog.
For more background about this method or about Chen’s kappa for cases with more than two options just Google it and you will get plenty of that.