FREE DM Review Site Registration!
Sign-up today and access DM Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all DM Review content

FREE access to web seminars, resource portals, our white paper library and more!

   
BI Review content and features are now in DMReview.com!

One brand, one Web site! DM Review is now the home of all the content you're used to at BIReview.com and much more. If you are registered at BIReview.com, you're already registered at DM Review. If not, take just a moment to sign up for all the free services we have for you at the new DMReview.com.

Data Visualization and Correct Syntax

From the earliest stages of our educational experience, we are taught to communicate effectively with conventions of grammar and syntax. Years after those school lessons, we may not cite any specific convention, but we have internalized a fluent ability for effective communication.

In a professional job interview, for example, an inability to speak with correct grammar and syntax is often a 'show-stopper' for further employment consideration. The inability to use the established conventions of syntax in language raises a red flag. These conventions of communication have come to be known as literacy skills and are a key component of success in any professional position.

The Merriam-Webster Dictionary defines syntax as "the part of grammar dealing with the way in which linguistic elements (as words) are put together to form constituents (as phrases or clauses)." This is a major component of literacy skills. By defining syntax to deal with "the formal properties of languages or calculi," the dictionary extends its meaning to also cover the conventions of communication for combining numeric elements known as numeracy skills.

My last BI Review column on data visualization highlighted an example from a medical journal that telegraphed a numeracy (and graphicacy) 'show-stopper' as we focused on the gratuitous use of 3-D, an option in many graphics software packages. As that column carefully developed, there is no redeeming value for using this 3-D option - except for confusing yourself and your audience, and invariably misunderstanding the reality in the data.

To help focus on that issue, the last column had the x-axis labeling slightly simplified from the medical journal but Figure 1, below, looks just like the original. This image exhibits additional problems in graphical communication, which appear quite often in data visualizations used in business. After you read this column, these issues will hopefully not appear in your organization, for those who create the visualizations - and those who use them in decision-making - will instantly spot them.

Figure 1

Numeracy in Action: Numerical Nomenclature

With numeracy skills, we have conventions when we communicate about a range of numbers. When we write "[0]" we mean only zero. When we write "(0, +1]" we mean the range beginning immediately above (but not including) zero, up to (and including) one. The use of the parenthesis indicates 'up to but not including' and the square bracket indicates 'up to and including.'

The image published in the medical journal violates these conventions of numeracy. To those sensitive to this nuance of numerical syntax, the labeling of this axis is jarring and does not make any sense. We do not know if this axis reflects a double counting of all patients with a 'residual' value of exactly -5, -1, 0, +1, or +5. Another option is that this is a typographical error. It is possible that half of the square brackets should have been parentheses, yet we do not know which half! How much credibility would you give to a written report that had half its verbs conjugated incorrectly?

Where is Zero: Found Twice or Sleight of Hand?

The text of the medical journal article suggests that it is important to note the large number of patients to the right side of 'zero.' Yet, in the image we find zero twice, in two buckets at the same time. This seems to support the notion of a typographical error, where the square bracket immediately after a zero [-0.1, 0] or before a zero [0, +0.1] should have been a parenthesis.

While a typographical error seems like a reasonable thesis, if true, then one of these supposedly equal-sized buckets would be larger than the other. The larger bucket would have a substantial number of patients with a zero residual stealthily added to it. The same care we use to assure veracity in our words must be used to assure there is no sleight of hand in our numbers and images.

Bucketization: Aggregating Data

What was done with this research data is similar to what is often done with business data. This research was based upon a study of 11,088 patients. (This is the sum of the number of patients in the table below the chart.) The data being plotted on the x-axis is called a 'residual' and was available to at least two, and maybe three or four significant figures. As is often done with business data these 11,088 data points were aggregated in to 'buckets.' It is possible that 20 to 80 buckets could have been selected, but in this case only six buckets were used - as shown by the six labeled categories and the six bars.

A similar situation in business might be plotting the distribution of the size of each of 11,088 orders a company had last quarter. If these orders were in the $10,000 to $90,000 range we would have order sizes with five significant figures. We might group those orders in to buckets that were $2K wide, for example ($10K, $12K], ($12K, $14K], ... ($88K, 90K]. In this case we would thus have 40 equal sized buckets in our distribution. The first would begin just above, but not including $10K, and go up to, and including, $12K. If we had 11,088 orders, than we would have a nice distribution of orders in our 40 buckets, each spanning an equal-sized $2K interval.

Interval vs. Ordinal Buckets

Let's look carefully at the buckets selected for the medical journal. In this case six buckets were selected. The far left and far right buckets cover an indeterminate range, as we are not given any information about how far they extend. The next buckets toward the middle on each side cover the range of four units, and the two buckets in the middle are only one unit wide. This is known as an ordinal scale. There is an ordering to the size of the buckets and they are certainly not equal intervals. On occasion, when done for a specific purpose and clearly identified to all who view the graphic image, this technique may have value. Invariably, though, ordinal buckets are done naïvely or they may be seen as a tip-off to malicious mischief being done with data. If the special purpose of ordinal buckets is not highlighted, nothing but confusion is added and the image cannot fairly communicate the reality in the data.

Six Buckets vs. Sixty+ Buckets

There is no need to take a rich data set of 11,088 values and aggregate them down to only six buckets. Figure 2 shows 11,088 values in 69 buckets, all of an equivalent size, and the graphic display fairly presents a distribution of the data. This is a rather balanced distribution, centered right on zero, and completely symmetrical from -0.5 to +0.5.

Figure 2

In fact, it is possible that Figure 2 may actually represent the true 'distribution' in this research data. Amazingly, the data in Figure 2 could be re-sorted in to the six ordinal buckets used in Figure 1, and this data would then look identical to Figure 1!

The implications of poor syntax in the boardroom

What this means is that many people may have been looking at the Figure 1 image in the journal, or in a policymaking board deciding on best practices for medical treatment options. They would be making a very important decision on this research addressed to issues in atrial fibrillation, a disorder found in about 2.2 million Americans. They would be looking at the image and getting the message presented in the surrounding text that this distribution seems to be shifted to the right, that many more patients were to the right side of the 'zero' point.

Yet, in Figure 1we have no idea as to where the zero point may actually be located. There are much too few buckets to adequately represent any distribution, and the ordinal buckets that were used present a completely unfair representation of a supposed distribution. Here, the violation of established conventions of syntax in numeracy and graphicacy should have raised a red flag. Someone should have called Figure 1 a 'show-stopper.'

Would such errors be spotted in a boardroom in your organization? Is your management team equally adept at using correct syntax with linguistic, numerical, and graphical communication? Do your data visualizations have correct syntax?


Howard A. Spielman, M.B.A., Ph.D., President of Management Semiotics International Inc., can be reached at HASpielman@ManagementSemiotics.com.

For more information on related topics, visit the following channels:



Industry Vendors