Data Visualization: A Scientific Treatment - Chat with your Data using AI

The above diagram was compiled by Florence Nightingale, who was – according to The Font – “a celebrated English social reformer and statistician, and the founder of modern nursing”.
It is gratifying to see her less high-profile role as a number-cruncher
acknowledged up-front and central; particularly as she died in 1910,
eight years before women in the UK were first allowed to vote and
eighteen before universal suffrage. This diagram is one of two which are
generally cited in any article on Data Visualisation. The other is Charles Minard’s exhibit detailing the advance on, and retreat from, Moscow of Napoleon Bonaparte’s Grande Armée
in 1812 (Data Visualisation had a military genesis in common with –
amongst many other things – the internet). I’ll leave the reader to look
at this second famous diagram if they want to; it’s just a click away.

While there are more elements of numeric information in Minard’s work (what we would now call measures), there is a differentiating point to be made about Nightingale’s diagram. This is that it was specifically produced to aid members of the British parliament in their understanding of conditions during the Crimean War (1853-56); particularly given that such non-specialists had struggled to understand traditional (and technical) statistical reports. Again, rather remarkably, we have here a scenario where the great and the good were listening to the opinions of someone who was barred from voting on the basis of lacking a Y chromosome. Perhaps more pertinently to this blog, this scenario relates to one of the objectives of modern-day Data Visualisation in business; namely explaining complex issues, which don’t leap off of a page of figures, to busy decision makers, some of whom may not be experts in the specific subject area (another is of course allowing the expert to discern less than obvious patterns in large or complex sets of data). Fortunately most business decision makers don’t have to grapple with the progression in number of “deaths from Preventible or Mitigable Zymotic diseases” versus ”deaths from wounds” over time, but the point remains.

Data Visualisation in one branch of Science

Coming much more up to date, I wanted to consider a modern example of
Data Visualisation. As with Nightingale’s work, this is not
business-focused, but contains some elements which should be pertinent
to the professional considering the creation of diagrams in a business
context. The specific area I will now consider is Structural Biology.
For the incognoscenti (no advert for IBM intended!), this area of
science is focussed on determining the three-dimensional shape of
biologically relevant macro-molecules, most frequently proteins or
protein complexes. The history of Structural Biology is intertwined with
the development of X-ray crystallography by Max von Laue and father and son team William Henry and William Lawrence Bragg; its subsequent application to organic molecules by a host of pioneers including Dorothy Crowfoot Hodgkin, John Kendrew and Max Perutz; and – of greatest resonance to the general population – Francis Crick, Rosalind Franklin, James Watson and Maurice Wilkins’s joint determination of the structure of DNA in 1953.

X-ray diffraction image of the double helix
structure of the DNA molecule, taken 1952 by Raymond Gosling, commonly
referred to as “Photo 51”, during work by Rosalind Franklin on the
structure of DNA

While the masses of data gathered in modern X-ray crystallography
needs computer software to extrapolate them to physical structures,
things were more accessible in 1953. Indeed, it could be argued that
Gosling and Franklin’s famous image, its characteristic “X” suggestive
of two helices and thus driving Crick and Watson’s model building, is
another notable example of Data Visualisation; at least in the sense of a
picture (rather than numbers) suggesting some underlying truth. In this
case, the production of Photo 51 led directly to the creation of the
even more iconic image below (which was drawn by Francis Crick’s wife
Odile and appeared in his and Watson’s seminal Nature paper^[1]):

It is probably fair to say that the visualisation of data which is displayed above has had something of an impact on humankind in the fifty years since it was first drawn.

Modern Structural Biology

Today, X-ray crystallography is one of many tools available to the
structural biologist with other approaches including Nuclear Magnetic
Resonance Spectroscopy, Electron Microscopy and a range of biophysical
techniques which I will not detain the reader by listing. The cutting
edge is probably represented by the X-ray Free Electron Laser, a device
originally created by repurposing the linear accelerators of the
previous generation’s particle physicists. In general Structural Biology
has historically sat at an intersection of Physics and Biology.

However, before trips to synchrotrons can be planned, the Structural Biologist often faces the prospect of stabilising their protein of interest, ensuring that they can generate sufficient quantities of it, successfully isolating the protein and finally generating crystals of appropriate quality. This process often consumes years, in some cases decades. As with most forms of human endeavour, there are few short-cuts and the outcome is at least loosely correlated to the amount of time and effort applied (though sadly with no guarantee that hard work will always be rewarded).

From the general to the specific

At this point I should declare a personal interest, the example of
Data Visualisation which I am going to consider is taken from a paper
recently accepted by the Journal of Molecular Biology (JMB) and of which my wife is the first author^[2]. Before looking at this exhibit, it’s worth a brief detour to provide some context.

In recent decades, the exponential growth in the breadth and depth of
scientific knowledge (plus of course the velocity with which this can
be disseminated), coupled with the increase in the range and complexity
of techniques and equipment employed, has led to the emergence of
specialists. In turn this means that, in a manner analogous to the early
production lines, science has become a very collaborative activity;
expert in stage one hands over the fruits of their labour to expert in
stage two and so on. For this reason the typical scientific paper (and
certainly those in Structural Biology) will have several authors, often
spread across multiple laboratory groups and frequently in different
countries. By way of example the previous paper my wife worked on had 16
authors (including a Nobel Laureate^[3]). In this context, the fact the paper I will now reference was authored by just my wife and her group leader is noteworthy.

The reader may at this point be relieved to learn that I am not going
to endeavour to explain the subject matter of my wife’s paper, nor the
general area of biology to which it pertains (the interested are
recommended to Google “membrane proteins” or “G Protein Coupled Receptors” as a starting point). Instead let’s take a look at one of the exhibits.

The above diagram (in common with Nightingale’s much earlier one)
attempts to show a connection between sets of data, rather than just the
data itself. I’ll elide the scientific specifics here and focus on more
general issues.

First the grey upper section with the darker blots on it – which is labelled (a) – is an image of a biological assay called a Western Blot (for the interested, details can be viewed here);
each vertical column (labelled at the top of the diagram) represents a
sub-experiment on protein drawn from a specific sample of cells. The
vertical position of a blot indicates the size of the molecules found
within it (in kilodaltons);
the intensity of a given blot indicates how much of the substance is
present. Aside from the headings and labels, the upper part of the
figure is a photographic image and so essentially analogue data^[4]. So, in summary, this upper section represents the findings from one set of experiments.

At the bottom – and labelled (b) – appears an
artefact familiar to anyone in business, a bar-graph. This presents
results from a parallel experiment on samples of protein from the same
cells (for the interested, this set of data relates to degree to which
proteins in the samples bind to a specific radiolabelled ligand).
The second set of data is taken from what I might refer to as a
“counting machine” and is thus essentially digital. To be 100% clear,
the bar chart is not a representation of the data in the upper part of
the diagram, it pertains to results from a second experiment on the same
samples. As indicated by the labelling, for a given sample, the column
in the bar chart (b) is aligned with the column in the Western Blot above (a), connecting the two different sets of results.

Taken together the upper and lower sections^[5] establish a relationship between the two sets of data. Again I’ll skip on the specifics, but the general point is that while the Western Blot (a) and the binding assay (b) tell us the same story, the Western Blot is a much more straightforward and speedy procedure. The relationship that the paper establishes means that just the Western Blot can be used to perform a simple new assay which will save significant time and effort for people engaged in the determination of the structures of membrane proteins; a valuable new insight. Clearly the relationships that have been inferred could equally have been presented in a tabular form instead and be just as relevant. It is however testament to the more atavistic side of humans that – in common with many relationships between data – a picture says it more surely and (to mix a metaphor) more viscerally. This is the essence of Data Visualisation.

What learnings can Scientific Data Visualisation provide to Business?

Using the JMB exhibit above, I wanted to now make some more general observations and consider a few questions which arise out of comparing scientific and business approaches to Data Visualisation. I think that many of these points are pertinent to analysis in general.

Normalisation

Broadly, normalisation^[6]
consists of defining results in relation to some established yardstick
(or set of yardsticks); displaying relative, as opposed to absolute,
numbers. In the JMB exhibit above, the amount of protein solubilised in
various detergents is shown with reference to the un-solubilised amount
found in native membranes; these reference figures appear as 100%
columns to the right and left extremes of the diagram.

The most common usage of normalisation in business is growth
percentages. Here the fact that London business has grown by 5% can be
compared to Copenhagen having grown by 10% despite total London business
being 20-times the volume of Copenhagen’s. A related business example,
depending on implementation details, could be comparing foreign currency
amounts at a fixed exchange rate to remove the impact of currency
fluctuation.

Normalised figures are very typical in science, but, aside from the growth example mentioned above, considerably less prevalent in business. In both avenues of human endeavour, the approach should be used with caution; something that increases 200% from a very small starting point may not be relevant, be that the result of an experiment or weekly sales figures. Bearing this in mind, normalisation is often essential when looking to present data of different orders on the same graph^[7]; the alternative often being that smaller data is swamped by larger, not always what is desirable.

Controls

I’ll use an anecdote to illustrate this area from a business
perspective. Imagine an organisation which (as you would expect) tracks
the volume of sales of a product or service it provides via a number of
outlets. Imagine further that it launches some sort of promotion,
perhaps valid only for a week, and notices an uptick in these sales. It
is extremely tempting to state that the promotion has resulted in
increased sales^[8].

However this cannot always be stated with certainty. Sales may have
increased for some totally unrelated reason such as (depending on what
is being sold) good or bad weather, a competitor increasing prices or
closing one or more of their comparable outlets and so on. Equally
perniciously, the promotion maybe have simply moved sales in time –
people may have been going to buy the organisation’s product or service
in the weeks following a promotion, but have brought the expenditure
forward to take advantage of it. If this is indeed the case, an uptick
in sales may well be due to the impact of a promotion, but will be
offset by a subsequent decrease.

In science, it is this type of problem that the concept of control
tests is designed to combat. As well as testing a result in the presence of substance or condition X, a well-designed scientific experiment will also be carried out in the absence of
substance or condition X, the latter being the control. In the JMB
exhibit above, the controls appear in the columns with white labels.

There are ways to make the business “experiment” I refer to above
more scientific of course. In retail business, the current focus on
loyalty cards can help, assuming that these can be associated with the
relevant transactions. If the business is on-line then historical
records of purchasing behaviour can be similarly referenced. In the
above example, the organisation could decide to offer the promotion at
only a subset of the its outlets, allowing a comparison to those where
no promotion applied. This approach may improve rigour somewhat, but of
course it does not cater for purchases transferred from a non-promotion
outlet to a promotion one (unless a whole raft of assumptions are made).
There are entire industries devoted to helping businesses deal with
these rather messy scenarios, but it is probably fair to say that it is
normally easier to devise and carry out control tests in science.

The general take away here is that a graph which shows some change in a business output (say sales or profit) correlated to some change in a business input (e.g. a promotion, a new product launch, or a price cut) would carry a lot more weight if it also provided some measure of what would have happened without the change in input (not that this is always easy to measure).

Rigour and Scrutiny

I mention in the footnotes that the JMB paper in question includes
versions of the exhibit presented above for four other membrane
proteins, this being in order to firmly establish a connection. Looking
at just the figure I have included here, each element of the data
presented in the lower bar-graph area is based on duplicated or
triplicated tests, with average results (and error bars – see the next
section) being shown. When you consider that upwards of three months’
preparatory work could have gone into any of these elements and that a
mistake at any stage during this time would have rendered the work
useless, some impression of the level of rigour involved emerges. The
result of this assiduous work is that the authors can be confident that
the exhibits they have developed are accurate and will stand up to
external scrutiny. Of course such external scrutiny is a key part of the
scientific process and the manuscript of the paper was reviewed
extensively by independent experts before being accepted for
publication.

In the business world, such external scrutiny tends to apply most
frequently to publicly published figures (such as audited Financial
Accounts); of course external financial analysts also will look to dig
into figures. There may be some internal scrutiny around both the
additional numbers used to run the business and the graphical
representations of these (and indeed some companies take this area very
seriously), but not every internal KPI is vetted the way that the report
and accounts are. Particularly in the area of Data Visualisation, there
is a tension here. Graphical exhibits can have a lot of impact if they
relate to the current situation or present trends; contrawise if they
are substantially out-of-date, people may question their relevance.
There is sometimes the expectation that a dashboard is just like its
aeronautical counterpart, showing real-time information about what is
going on now^[9].
However a lot of the value of Data Visualisation is not about the here
and now so much as trends and explanations of the factors behind the
here and now. A well-thought out graph can tell a very powerful story,
more powerful for most people than a table of figures. However a
striking graph based on poor quality data, data which has been combined
in the wrong way, or even – as sometimes happens – the wrong datasets
entirely, can tell a very misleading story and lead to the wrong
decisions being taken.

I am not for a moment suggesting here that every exhibit produced using Data Visualisation tools must be subject to month…[TRUNCATED]

Related Posts

Leave a Comment Cancel Reply