Lecture 14 – Principles of Data Visualization

Arvind R. Subramaniam

Assistant Member

Basic Sciences Division and Computational Biology Program

Fred Hutchinson Cancer Research Center

Learning Objectives

  • Know general dos and donts of data visualization
  • Know about different types of data visualization
  • Effective strategies for visualization

Useful reference

(Source of many figures in this lecture)

Fundamentals of Data Visualization by Claus O. Wilke

Goals of Visualization

  • Show experimental design and results
  • Show relationships among variables
  • Range and interval of a variable

Same data can be visualized very differently

temp-normals-vs-time-1.png

Wilke 2018

Same data can be visualized very differently

four-locations-temps-by-month-1.png

Wilke 2018

Same data can be visualized very differently

temperature-normals-polar-1.png

Wilke 2018

Elements of a visualization

  • Aesthetics
  • Scales
  • Labeling
  • Exporting

Aesthetics

common-aesthetics-1.png

Wilke 2018

How do you pick the type of aesthetics?

  • How many variables and data points do you want to show?
  • Is your data continuous or discrete?
  • Is there a natural order of discrete variables?

Scales

basic-scales-example-1.png

Wilke 2018

Log axes suitable for P-values

dong_2019_crispr_screen.png

Dong 2019

Log axes suitable for fold-changes

10xaag_wt_log2.png

Park 2019

Log-log plots are common in biology

dong_2019_volcano_plot.png

Dong 2019

Labels

If you take away only one single lesson from this book, make it this one: Pay attention to your axis labels, axis tick labels, and other assorted plot annotations. Chances are they are too small. In my experience, nearly all plot libraries and graphing softwares have poor defaults. If you use the default values, you’re almost certainly making a poor choice.

Wilke 2018

Cannot-read labels

Aus-athletes-small-1.png

Wilke 2018

Small labels

Aus-athletes-ugly-1.png

Wilke 2018

Appropriately-sized labels

Aus-athletes-good-1.png

Wilke 2018

Too-big labels

Aus-athletes-big-ugly-1.png

Wilke 2018

Seemingly big but ok labels

Aus-athletes-big-good-1.png

Wilke 2018

Exporting

  • Finalize the figure within R as much as possible.
  • Use vector graphics for saving: PDF or SVG.
  • Inkscape – useful open source vector graphics program for editing figures.
  • Make sure that text can be edited when you open the image.

Colors

Why use colors?

  1. As a tool to distinguish
  2. To represent data values

Color to distinguish groups

findlay_2018_colors_example.png

Findlay 2018

Color to represent quantitative data

tukiainen_2017_heatmap.png

Tukiainen 2017

How to choose colors

Avoid using many colors in a single graph

biddy_2018_lot_of_colors.png

Biddy 2018

How to choose colors

fluorescent_micrograph_colorblind_example.jpg

fluorescent_micrograph_colorblind_simulation.jpg

Use colorblind-friendly palettes

colorblind_friendly_palette.png

Different types of data that we want to visualize

Amounts
X-Y Relationships
Distributions
Proportions

Common types of data visualizations

types_of_visualizations.png

Visualizing uncertainty

Standard Error
Confidence Bands

types_of_uncertainty.png

Visualizing uncertainty

Standard Error
Confidence Bands

Standard deviation or standard error?

Standard deviation does not decrease with more measurements.

Yes or No?

hawaii-income-bars-bad-1.png

Linear scales should begin at 0.

Wilke 2018

Yes or No?

oceania-gdp-logbars-1.png

Bar areas are not proportional to value.

Principle of proportional ink

Wilke 2018

Yes or No?

oceania-gdp-dots-1.png

Log data values are best shown as points.

Wilke 2018

Which is better – pie or bar?

RI-pop-pie-1.png

Which is better – pie or bar?

RI-pop-bars-1.png

Bars are more accurately perceived than areas.

Wilke 2018

Yes or No?

mpg-cty-displ-solid-1.png

Overlapping points can be hidden.

Wilke 2018

Yes or No?

mpg-cty-displ-transp-1.png

Make points semi-transparent.

Wilke 2018

Yes or No?

mpg-cty-displ-jitter-1.png

Slightly jitter points along the direction of overlap.

Wilke 2018

Yes or No?

mpg-cty-displ-jitter-extreme-1.png

But too much jittering can be misleading.

Wilke 2018

Yes or No?

nycflights-points-1.png

Neither transparency nor jittering will help when data density is too high.

Wilke 2018

Yes or No?

nycflights-2d-bins-1.png

Binned 2D histograms is a good solution.

Wilke 2018

Yes or No?

nycflights-hex-bins-1.png

Hexagonal bins are slightly more accurate.

Wilke 2018

Yes or No?

tech-stocks-bad-legend-1.png

Legend order does not match plot order.

Wilke 2018

Yes or No?

tech-stocks-good-legend-1.png

Legend order matches plot order.

Wilke 2018

Yes or No?

tech-stocks-good-no-legend-1.png

Pick direct labeling over legend.

Wilke 2018