I argue that there are a range of human environments for interpreting data and that any presentation of results should be crafted around this context.
In this new information age, there are more bad charts than ever.
The basic principles of good data visualization can be found in reading the existing literature. There are tips for presenting tables, histograms, legends, colors, etc.
There is also research on how (not) to manipulate audience perception with data. For example, if someone wants to show more positive growth over time, they can change their base year or re-scale their axis (not to mention manipulate the growth model itself!).
An under-studied area for visual perception is the environmental context for interpreting data. For simplicity, let’s demarcate data visualization into two different environments: the group environment and the individual environment. The typical individual environment is when someone is isolated behind a computer in an office. A typical group environment is in presentation of data on a large screen in a conference setting.
One potential problem in the production of data visualizations is that most presentations are created in the individual environment but results are shared in a group environment.
Is the median statistical programmer/designer producing sub-optimal visualizations because they are designing for their current, isolated environment? Are they taking the presentation environment into consideration? Should they? If so, how?
In order to proceed in this line of questioning, one must firsk ask if the human environment matters in interpreting data. If so, how do we define these environments and their effects?
I’ve written a paper proposing a two-factor within-subjects design, where subjects are exposed to multiple visualizations under both environments. Individuals are randomly allocated to two different sequences in order to control for period effects from the first treatment. In each sequence, they are shown similar (but different) visualizations and asked to answer specific questions. The response variable is the latency or delay in answering the question.
I’ve written a proposal paper, which incorporates both theory and synthetic data analysis.
This line of research is important for improving our understanding of ourselves. It is a work of statistical psychology. If we do interpret data individually depending on our environment, we need to optimize our visualizations for correct interpretation of data and understanding. The statistical community should be leading the conversation on this.
On the production side, I would go so far as to propose a convolutional neural network that reads your existing chart and rates it for interpretability under different human environments using data from these experiments.
A deep learning model could rate elements of the chart and, if you upload your data, it can code up those modifications to the existing chart, thus optimising the chart for the environment.