This week I would like to discuss the visual representation of data. This entry is directly and heavily influenced by Edward R. Tufte's 1983 book tilted "The Variable Display of Quantitative Information". You can check it out here! Data represented in graphical form is of the utmost importance, particularly when it comes to science and relaying your results to others. We will be discussing what makes a good and poor representation. Just like words, images can convey complex meaning and further our understanding of a topic. Tufte suggests the following criteria to be meet for a suitable representation.
- Show the data, with clarity and integrity
- Make large datasets understandable
- Encourage the reader to mostly think about the substance (not methodology, production, techniques, etc.)
- Reveal data at multiple levels (from big picture to the minutiae)
Within these parameters we can gain a foundation for what can make or break a graphical display. Let's dive into a few examples of each and see what went right or wrong.
In this visualization, the US is broken into counties and death tolls from cancer are given by rates of 100,000 people per county. Generally, the darker the hue, the higher the death toll is in that county. The representation does an excellent job at showing where there are curious groupings of counties that would seem to be a relative hotspot of deaths. Our eye is immediately drawn to the Kentucky/West Virginia border. From there, we can gleen a general idea that the cancer death tolls are higher in the south than the west coast. This a good example of showing big picture trends along with the finer details. It also demonstrates keeping the readers attention on the data rather than methodology. A modern aspect of this map, if you click through on the link, is the time variability that can be played with. So not only does it show the counts over a year, but does so over many years. What this image does not represent accurately is the population distribution. The geographical area of the counties may lead us to think the county lines show population too, but that is not the case. Additionally, death tolls can be hard to accurately survey, since coroners/doctors may not always put the correct cause of death, but rather the easiest or most obvious one.
Example 2: Napoleon's Retreat Minard
Here, in the bottom panel, we have a display drawn up in by Charles Joseph Minard detailing the defeat of Napoleon's army in 1812. Tufte explains, "It may well be the best statistical graphic ever drawn." This map shows the path and size of Napoleon's army from the western edge of Russia to Moscow. On the left side, in red, is the start of the army's movement. The width of the line represents how large the army is at that point in time, starting at 4000,000+ soldiers. The offshoots show small groups of soldiers breaking off to prevent a flank attack. The direction of the army was generally east until they arrived in Moscow. They ended up retreating and their path to return home is denoted with the black line, ending with 10,000 soldiers. The bottom graph also indicates the temperatures during the retreat. There are six variables shown here: army size, direction, location in 2D space, dates, and temperature during the retreat. The amount of information contained here could've been read through a table but the integration into map made it incredibly more meaningful and interesting.
Example 3: Marey_Paris_Train_Schedule
This graphic, made in 1885, is from E.J. Marey, Le Methode Graphique, page 20. This shows the train schedule from Lyon to Paris, France in the 1880. The horizontal lines across the plot show the relative physical distance between each stop. The vertical lines describe time. Lines at an angle, broken up with horizontal lines which indicate waiting time, indicate the train speed. The closer to vertical the line is, the faster it travels. Lines are from Lyon to Paris and visa versa. So, with this display we can figure out how long it might take to get from Lyon to Paris depending on what time we left. This graph clearly has lots of information and can be overwhelming. But the depiction is creative and illustrative of train scheduling.
Discussion: Data-Ink Ratio
In the theory of data graphics the montra is "Above all else show the data". So with that, we want to ideally show only and exactly what is necessary for us to print. Data-ink is defined by the non-redundant core drawn data. This is the data that you can't erase without losing important information. According to Tufte, the data-ink ratio is
Data-Ink-Ratio = Data-Ink / Total Ink In Graphic
The ideal ratio, of course, would be 1 where only the necessary information is drawn and there is no excess. This isn't the case for most graphs. But there is a balance that needs to be struck. As long as a justifiable reason can be made to include the ink, then it can be fine. Simplifying your graphs and plots can be a huge step forward into making your data easier to read. There are tips and tricks like range plots and dot dot dash plots that can reduce the amount of ink needed to create your figure.
Take the Marey Schedule with it's Data-Ink-Ratio improved. It certainly looks more readable!