Quantification of Heat Map Data Displays for High-Throughput Analysis

Heat maps have been used as a means to visualize high-density information in settings as diverse as astronomy, business analysis, and meteorology. Discovery biology research teams have also used heat maps to visualize gene clusters in genomics investigations or to study amino acid distribution in protein sequence analysis. Commercially available software packages, like Spotfire ® or SAS JMP® afford scientific investigators the ability to construct heat maps and visualize information from studies, yet do not offer any form of summary statistic that would be useful in high-throughput investigations comparing the results of a large number of data visualizations simultaneously or viewing changes in the display longitudinally (over time). Previously, Juneau suggested the usage of Plotnick’s characterization of lacunarity (1996) for two-dimensional heat map data displays in two colors or shades. For c ( c >2) discrete shades (in a monochromatic map) or hues (in a full color display), the author will suggest a modification to Plotnick’s approach using the underlying gliding box approach developed by Allain and Cloitre


Introduction Background
Heat maps have been employed as a data visualization tool in the social sciences since the late nineteenth century [1][2][3], and more recently in arenas as diverse as astronomy [4][5][6], business analysis [7][8][9], meteorology [10][11][12] and quantum mechanics [13]. Changes in the ambient conditions of complicated systems like galaxies, the stock market, or large meteorological phenomena can be readily displayed via differences in color or grayscale to assist investigators in hypothesis generating or data interpretation activities. Heat maps afford investigators the ability to study high-density data sets in a single visualization, while maintaining measurement relationships and data integrity [14][15][16].
The popularity of heat maps has grown substantially with developments in the field of bioinformatics [17]. Numerous examples of published methods employing heat maps exist [18][19][20][21][22][23][24][25][26][27][28][29][30][31] however, the author has not, to date, seen an attempt to represent the information present in these very widely used data visualizations with a single summary statistic. Commercially available software packages, like Spotfire® or SAS JMP® , afford scientific investigators the ability to construct heat maps and visualize information from studies, yet do not offer any form of summary statistic that would be useful in highthroughput investigations comparing the results of a large number of data visualizations simultaneously or viewing changes in the display longitudinally (over time). Previously, Weinstein (1997) had suggested using a "difference heat map" to compare visualizations pre and posttreatment. This approach seems tenable if only two time points are under consideration, but would result in several pre-post difference heat maps if one were interested in comparing change with baseline over time or all pair-wise comparisons of time responses.
One approach to numerically summarizing the content of a heat map might be to use a statistic like the percentage of the entire map colored or shaded by a specific hue or degree of brightness. One could summarize each heat map by the percentage of tiles of a given color relative to the whole. The use of such a statistic would not differentiate the apparent geometry of the heat map depicted in Figure   Abstract Heat maps have been used as a means to visualize high-density information in settings as diverse as astronomy, business analysis, and meteorology. Discovery biology research teams have also used heat maps to visualize gene clusters in genomics investigations or to study amino acid distribution in protein sequence analysis. Commercially available software packages, like Spotfire ® or SAS JMP ® afford scientific investigators the ability to construct heat maps and visualize information from studies, yet do not offer any form of summary statistic that would be useful in high-throughput investigations comparing the results of a large number of data visualizations simultaneously or viewing changes in the display longitudinally (over time).
Previously, Juneau suggested the usage of Plotnick's characterization of lacunarity (1996) for two-dimensional heat map data displays in two colors or shades. For c (c>2) discrete shades (in a monochromatic map) or hues (in a full color display), the author will suggest a modification to Plotnick's approach using the underlying gliding box approach developed by Allain and Cloitre , but with an alteration in the means of counting features.
1.1.1a from that of the other two. The geometries of Figures 1.1.1 b and 1.1.1 c might suggest an underlying block structure relationship between rows and columns, which possibly could be related to an underlying multivariate mechanism with a block diagonal covariance matrix [32]. A black diagonal band bisects the block structure. Thus, a percentage approach does not account for the overall pattern presented in a heat map and would therefore not serve as a useful numerical summary of the relationships suggested.
A second approach could be to characterize the spacing-filling geometry of the tiles in the heat map via a Hausdorff-like dimension [1]. Consider a non-empty subset of ℜ n , say S, which may be covered with sets of diameter at most µ (µ>0), such that the diameter of the covering sets is normalized to S (i.e., the size of S is considered unity).  The Hausdorff dimension is calculated in mathematics as a limiting procedure related to the logarithm of the number of sets of diameter at most, say µ, that cover S as the logarithm of the diameter of these sets approaches zero [33]. In practice, it might be the case that size of the features that form a pattern might be of interest, as well as the patterns themselves. Thus, usage of the Hausdorff dimension in a strict mathematical sense might not be practical because of the investigator's desire to consider features at least as large as µ (µ>0).
Consider the bi-colored heat map displayed in Figure  From the covering it is easy to calculate a Hausdorff dimension for µ=0.1. The major shortcoming of this approach is that, as was the cause with the percentage summary, the geometry of two heat maps may be markedly different; however, the Hausdorff dimension could be identical. An example of this phenomenon is illustrated in Figure 1 This limitation of such dimension measurements was first recognized by Mandelbrot [34] in the context of numerically summarizing fractals. Mandelbrot advocated the usage of a quantity called the lacunarity to numerically summarize fractals. The root of the word lacunarity is the Latin lacunae, meaning "space" or "hole". Thus, as the Hausdorff dimension characterizes the "space-filling" properties of a set, the lacunarity measures the presence of gaps or holes in the set.
Juneau [1] suggested the usage of lacunarity to characterize heat maps in two colors or shades, based upon a method suggested by Plotnick [35]. Plotnick's method was based upon a more general case originally developed by Allain and Cloitre [2]. This approach will provide an investigator with a measure of the "gapiness" for one shade or color relative to a second. The balance of this paper will be based upon a suggested method for a setting with c color or shades, for c>2. Section 1.2 will summarize the gliding box approach of Allain and Cloitre and illustrate its behavior for a heat map in 2 colors or shades. Section 2 will introduce a proposal for a form of modified lacunarity for the setting of c colors or shades (c>2) and highlight the scaling feature of the approach. Section 3 will provide examples of three applications of the technique: cluster analysis, the summarization of longitudinal data in meteorology, and genomics.

The gliding box approach of Allain and Cloitre and the calculation of a heat map's lacunarity for heat maps in two colors or shades
For some heat map, H ⊂ ℜ 2 , let P represent a board [36] that partitions H: (1) Define p to be a polygon with s sides and diameter (p)=ρ, where diameter( ) =maximum of the lengths of the s sides of p; (3) diameter (p i )=diameter(p j ) ∀ i,j ∈ Z + . Without loss of generality, p can be defined to be a rectangle. A subset p ⊂ P can be called a feature of the heat map. Define a box, B ⊂ P, to consist of a set of contiguous features, p, whose union is similar to the polygon, p.
The total set of scores, say T, (m=1 to T) for the movement of B T times across H, may now be tallied to form a discrete probability mass function for all possible values from 0 to k 2 for a box of diameter, say, k. Call the probability mass function for all values from 0 to k 2 , Ψ. For two given sets, the one with the larger value of Γ will be a set of gray features more diffusely distributed throughout; i.e., the occurrence of black regions will be more frequent and their size relatively larger. In the second row, the scores would be 2,3,2,3 and 2. Thus, if one were to proceed moving the gliding box over the entire heat map for the remaining three rows:

Development of a modified lacunarity using the Allain and Cloitre approach
Consider the heat map, H c , depicted in       (those that are not gray). The goal is to develop a form of measurement that simultaneously summarizes the clustering of colors or shades for feature subsets of H c relative to the other colors or shades.
One possible approach to developing a modified Allain and Cloitre procedure for multi-shaded or multi-colored images or heat maps would be to count the number of neighboring discordant pairs within B as it traverses over H c. In an intuitive sense, studying the distributional properties of the discordant pairs of features within the gliding box B provides a summary of the density of sets of features contained within H c . Just as the lacunarity, Γ, defined in equation 1.2.3 for a two-shaded or bi-colored heat map increases as the number of the subsets with large clusters of features with the desired shade or color decreases, a lacunarity, Γ mac , based upon modifying the gliding box algorithm of Allain and Cloitre that counts discordant pairs will increase as the number of subsets with large clusters of any color or shade decreases for a given heat map H c .
Consider the gliding box, B c , as defined in Figure 2

The Choice of box size and its influence on the calculation of Γ mac
Consider the situation illustrated in Figure 2.2.1. For a given choice of the gliding box's diameter, the coverage of the box over the heat map can result in a different number of features that may be contained within the gliding box as it completes its first row of coverage. Figure  2.2.1 illustrates three choices of box size. When k=2, note that for each row the movement of the box will result in the same number of partition pieces covered; however, this is not the case for k=3 and k=4. The algorithm suggested in Section 2.1 may still be employed in the circumstances illustrated in Figure 2.2.1 for k=3 and k=4 with minimal effect on the calculation of the modified lacunarity if the number of movements of the box is large relative to the size of the heat map. When a box spans a region outside of the heat map, the author recommends using the convention that the discordances be measured only on the portion of the box that is covering the heat map. An illustration of this suggested convention is illustrated in Figure 2 Borys [37] derived the relationship between a reference lacunarity, say Γ 0 , with a gliding box with a diameter µ 0 (normalized to the heat map such that the heat map's size is considered unity) and the lacunarity determined for a box with a diameter µ (normalized to the same heat map), say, Γ: where D is the generalized fractal dimension reported in Ott [38]. For a covering of a set, it is possible to determine a fixed D, and a lacunarity calculated on a normalized diameter of µ can be compared to that of one based on a normalized diameter of µ 0 . Thus, despite the fact that the user of a lacunarity-based technique is free to choose his or her value of scale, the lacunarity values for different choices of scale can be related via 2.2.1. If two users cannot agree on a common diameter for the box, one can easily transform the corresponding lacunarity values from one scale to another.

Illustrations of the Technique in Simulated and Real Data
An example of the calculation of the modified lacunarity in the arena of cluster analysis with simulated data As mentioned in Section 1.1, heat maps are used to summarize results frequently in the field of bioinformatics, primarily after a cluster analysis is performed. Packages like SAS JMP ® (version 9) allow users to examine the results of cluster analyses via a color map within the multivariate analysis module in the analysis platform of the product. Two data sets were simulated with SAS JMP ® . The first data set consisted of 32 8-tuples (items with 8 attributes) of simulated uniform (0,1) variates. Perfect agreement between the components of 25% of the cases was artificially induced in the simulation to create large block structures in the SAS JMP ® color map (i.e., to artificially create a very dense cluster of a small set of the items). A second data set consisted of 32 items. The first attribute or component of each item was simulated from a standard Gaussian distribution. For the next 5 of the components, half of the items were assigned linear combinations of the first attribute; the remaining half was assigned random Gaussian noise. The sixth and seventh components consisted of values simulated from a Gaussian distribution. The eighth component was simulated from a uniform (0,1) distribution. The two data sets were independently analyzed using the default options (hierarchical and Ward's method) in the multivariate analysis module within the analysis platform of SAS JMP ® . The results of the cluster analyses for the two data sets are shown in   1 (b), suggesting that the color map of the first has more "holes" or regions of a single color (consistency) than that of the second. A gross inspection of the data represented in the three heat maps allows the observer to glean some information. As expected, the average daily temperature of Lima, Peru is more consistent than the other two cities (because of its latitude and topology), as evident by the large blocks of color present in the corresponding heat map. Moreover, the average daily temperature of Duluth, Minnesota is more varied, as is evident by the relatively smaller blocks of color present in its corresponding heat map. These evaluations are highly subjective. If one needed to organize a large series of heat maps, of which these three are representative, an index of temperature consistency might be of value.
If one were interested in a single index quantifying the consistency of the temperatures for the three cities in the first 15 days of January for 15 years, he or she could use the modified lacunarity. Suppose that a meteorologist were interested in the agreement of temperatures between three days over three years as an estimate of consistency. He or she could use the modified lacunarity as a possible descriptive statistic.

An example of the calculation of the modified lacunarity for a study of longitudinal system-based analysis of transcriptional responses to Type I Interferons
A third illustration of the modified calculation was made based on Figure 1.D in [39] a study of the transcriptional response to Type I Interferons. If one were to use a 2x2 gliding box, it would be possible to use the figure to estimate Γ(2) for the portions of the heat map corresponding to APP x IFN-β1a (upper left-hand 10x6 portion), APPxIFN-α2b (right-hand side, adjacent 10x6 portion), JS x APP x IFN-β1a (12x6 portion below APP x IFN-β1a) and APPxIFN-α2b (12x6 portion right-hand side, adjacent portion). The values would be 1.69, 1.13, 1.10 and 0.91. These values are consistent with the concept of the modified lacunarity: to find large homogeneous blocks within the heat map with little diversity in color. The figure with the most color change is APPxIFN-α2b, suggesting a greater change in activity (varying from very dark blue to very bright yellow, or from -1 to 1, respectively.

Discussion
The concept of lacunarity introduced by Mandelbrot is used as a summary statistic in many applications [40][41][42][43][44][45]. To date, the lacunarity statistic has been applied only in settings with images of two colors or shades and has not been applied to summarize the content of heat map data displays. The proposed modified lacunarity statistic affords investigators the option of summarizing or indexing large numbers of heat maps based upon the presence of large monochromatic blocks of features. The modified lacunarity proposed in this work is easy to compute and its interpretation intuitive in applied settings.
The limitation of this proposed statistic is its small variation relative to the larger variation perceived by the interpreter of a heat map. This feature is evident in Figure 3.2.1. The cause of this small variation in the value of the modified lacunarity is most likely due to the smaller correlations in yearly temperatures for a given day relative to the correlations between days within a given year. Where larger correlations exist between rows and columns in a heat map, larger blocks of uniform color can exist (see All of the lacunarity calculations performed in this work were calculated with MS Excel 2003. With the current state of the art in computer software, it seems plausible that one could output the color codes used to produce an image, transfer them to a simple spreadsheet and automate the gliding box process for several box sizes. Thus, due to the relatively simple implementation, ease of calculation and its intuitiveness, the modified lacunarity has the potential to be a new tool that can be used in many scientific arenas to aid in exploratory data analysis and subsequent hypothesis generation.