Author(s): Mike Heath, Sudeep Sarkar, Thomas Sanocki, Kevin Bowyer
Because of the difficulty of obtaining ground truth for real images, the traditional technique for comparing low-level vision algorithms is to present image results, side by side, and to let the reader subjectively judge the quality. This is not a scientifically satisfactory strategy. However, human rating experiments can be done in a more rigorous manner to provide useful quantitative conclusions. We present a paradigm based on experimental psychology and statistics, in which humans rate the output of low level vision algorithms. We demonstrate the proposed experimental strategy by comparing four well-known edge detectors: Canny, Nalwa–Binford, Sarkar–Boyer, and Sobel. We answer the following questions: Is there a statistically significant difference in edge detector outputs as perceived by humans when considering an object recognition task? Do the edge detection results of an operator vary significantly with the choice of its parameters? For each detector, is it possible to choose a single set of optimal parameters for all the images without significantly affecting the edge output quality? Does an edge detector produce edges of the same quality for all images, or does the edge quality vary with the image?