May 19, 2015 by jerome on charts, d3, javascript, tips

You may not need d3

If you’re working on visualization on the web, the go-to choice is to use d3js (or a higher-level library). d3js is powerful and versatile. The best proof of that is the lead author of d3, Mike Bostock, worked until recently at the New York Times which many, including myself, consider the ultimate reference in terms of information graphics. At the NYT and elsewhere, d3js has powered breathtaking projects.

d3js was built as a successor to protovis. Protovis was a slightly higher-level library than d3js, more specialized in data graphics, and designed with the assumption that it could be used with little or no programming experience. And indeed this was true. When I started using protovis in 2009, my javascript skills were limited, and so I learned by deconstructing and recombining the examples. I kept on doing that when d3 came along – learning through examples. And the d3 examples, like those in protovis before, were short and standalone, demonstrating how easy it was to create something with a few lines of code.

d3js requires more technical knowledge than protovis did to get started, but not much. However, it is much more versatile – no longer constrained to charts and maps, it can handle a lot of the front-end functionalities of a web site. And so, it is possible to make visualizations by:

learning how to do stuff in d3js,
learning the essential javascript that you need to run d3js, as needed (i.e. variables, if-then statements, array methods…)
then, learning through doing, and obtain a feeling for how SVG, HTML and the Document Object Model works.

This approach is actually very limiting.

You would create much more robust code by:

learning javascript,
learning SVG/HTML/DOM,
then learning and appreciate d3 logic.

That would give you the choice when to use a d3js method, when to use another library (past, present or future), or when to use native javascript.

The point of this article is to go through the main functions of d3js and see how they can be replicated without using a library. As of this writing, I am convinced that using d3js is the most sensible way to handle certain tasks (though this may change in the future). In almost any case I can think of, even when it’s not the most efficient way to do so, using the d3js approach is the most convenient and the easiest from the developer’s perspective. But I also believe it’s really important to know how to do all these things without d3js.

This will be a very long post. I’m trying to keep it as structured as possible so you can jump to the right section and keep it as a reference. This also assumes you have a certain familiarity with d3js and javascript – this is not an introductory post.

I’ve divided it in 4 parts:

Tasks that you really don’t need d3js for. And ok, you may want to use it regardless. But here’s how to do those things without it. (manipulating DOM elements)
interlude: a reflexion on d3js data binding. Do you really need that?
Tasks which are significantly easier with d3js (working with external data, animation)
Tasks where as of now, d3js is clearly superior to anything else (scales, array refactoring, map projections, layouts).

Selecting and manipulating elements

At the core of d3js is the notion of selection. In d3js, you select one or several elements which become a selection object. You can then do a variety of things:

join them with data, which is also an essential tenet of the d3 philosophy,
create or delete elements, especially in relation to the data you may have joined with their parent,
manipulate these elements by changing their attributes and their style, including in function of the data joined with each of them,
attach events listeners to those elements, so that when events happen (ie somebody clicks on a rectangle, or change the value of a form) a function might be triggered,
animate and transform those elements.

Selecting elements and parsing the DOM

d3 selection objects vs Nodes, HTML NodeList vs HTML LiveCollections

When you select something with d3js, using methods such as d3.select() or d3.selectAll(), this returns a d3 selection object. The selection object is a subclass of the javascript object Array – check here for all the nitty-gritty. The gist of it is that d3 selection objects can then be manipulated by d3 methods, such as .attr() or .style() to assign attributes or styles.

By contrast, the DOM is made of Node objects. The root of the DOM, the Document Object Model, is a special Node called Document. Nodes are organized in a tree-like structure: Document has children, which may have children etc. and encompass everything in the page. The Node objects are really the building blocks of a web page. When an element is added, be it an HTML element like a <div> or an SVG element like a <rect>, a new Node is created as well. So d3js has to interact with Node objects as well. d3 selection objects are tightly connected to their corresponding Node objects.

However, with “vanilla” javascript, you can directly access and manipulate the Node objects.

d3.select / d3.selectAll vs document.querySelector / document.querySelectorAll

In d3js, you parse the document using d3.select (to select one object) or d3.selectAll. The argument of this method is a CSS3 selector, that is, a string which is very similar to what could be found in a CSS style sheet to assign specific styles to certain situations. For instance, “g.series rect.mark” will match with all the rectangles of the class “mark” which are descendants of the SVG g groups of the class series.

When d3js was introduced in 2011, javascript didn’t have an equivalent syntax – instead you could select elements by class, or by id, or by tag name (more on that in a minute). Things have changed however and it is now possible to use CSS3 selectors using document.querySelector (which will return just one node) or document.querySelectorAll (which will return an HTML NodeList).

An HTML NodeList is a special object which is kind of like an array of Node objects, only it has almost no array methods or properties. You can still access its members using brackets, and get its length, but that’s it.

I wrote document.querySelectorAll, because you can use this method from the document, but you can use it from any Node. Those two snippets of code are parallel:

var svg = d3.select("svg"); // svg is a d3 selection object
var g = svg.selectAll("g"); // g is a d3 selection object

var svg = document.querySelector("svg"); // svg is a Node
var g = svg.querySelectorAll("g"); // g is a NodeList

Getting elements by class name, ID, tag name, name attribute

d3js doesn’t have a special way to get all descendants of a selection of a certain class, of a certain ID, etc. The CSS3 selector syntax can indeed handle all those cases, so why have a separate way?

By contrast, javascript pre-2011 didn’t have a querySelectorAll method, and so the only way to parse the document was to use more specific method, like document.getElementsByClassName().

document.getElementsByClassName() retrieves all descendants of a certain class. document.getElementsByName() retrieves elements with a certain “name” attribute (think forms). documents.getElementsByTagName() gets all descendants of a certain type (ie all <div>s, all <rect>s, etc.).

What’s interesting about that is that what is returned is not an HTML NodeList like above with querySelectorAll, but another object called HTML Live Collection. The difference is that matching elements are created after, they would still be included in the Live Collection.

var svg = d3.select("svg");
svg.selectAll("rect").data([1,2,3]).enter().append("rect");
var mySelection = svg.selectAll("rect"); // 3 elements
mySelection[0].length // 3
svg.append("rect");
mySelection[0].length // 3
mySelection = svg.selectAll("rect"); // re-selecting to update it
mySelection[0].length // 4

var svg = document.querySelector("svg");
var ns = "http://www.w3.org/2000/svg";
var i;
for (i = 0; i < 3; i++) {
  var rect = document.createElementNS(ns, "rect"); // we'll explain creating elements later
  svg.appendChild(rect);
}
var mySelection = svg.getElementsByTagName("rect"); // 3 elements
var rect = document.createElementNS(ns, "rect");
svg.appendChild(rect);
mySelection.length // 4 - no need to reselect to update

How about IDs? there is also the getElementById (no s at elements!) which only retrieve one element. After all, IDs are supposed to be unique! if no elements match, getElementById returns null.

Children, parents, siblings…

Truth be told, if you can use selectors from the root, you can access everything. But sometimes, it’s nice to be able to go from one node to its parents or its children or its siblings, and d3js doesn’t provide that. By contrast, the Node object has an interface that does just that – node.childNodes gets a nodeList of child nodes, node.parentNode gets the parent node, node.nextSibling and node.previousSibling get the next and previous siblings. Nice.

However, most often you will really be manipulating elements (more on that in a second) and not nodes. What’s the difference? all Elements are Nodes, but the reverse is not true. One common example of Node which is not an Element is text content.

To get an Element’s children, you can use the (wait for it) children property. The added benefit is that what you get through children is a LiveCollection (dynamic), while what you get through childNodes is a NodeList (static).

var svg = document.querySelector("svg");
var ns = "http://www.w3.org/2000/svg";
var i;
for (i = 0; i < 3; i++) {
  var myRect = document.createElementNS(ns, "rect"); // we'll explain creating elements later
  svg.appendChild(rect);
}
// the variable myRect holds the last of the 3 <rect> elements that have been added
svg.childNodes; // a NodeList
myRect.parentNode; // the svg element
myRect.nextSibling; // null - myRect holds the last child of svg.
myRect.previousSibling; // the 2nd <rect> element
svg.firstChild; // the 1st <rect>. Really useful shorthand
svg.querySelector("rect"); // also the 1st <rect>.
svg.children; // a LiveCollection

Adding/reading attributes, styles, properties and events

Node, Element, EventTarget and other objects

In d3 101, right after you’ve created elements (to which we’ll come in a moment), you can start moving them around or giving them cool colors like “steelblue” by using the .attr and .style methods.

Wouldn’t that be cool if we could do the same to Node objects in vanilla javascript!

Well, we can. Technically, you can’t add style or attributes to Node objects proper, but to Element objects. The Element object inherits from the Node objects and is used to store these properties. There is also an HTMLElement and SVGElement which inherit from the Element object.

If you look at the Chrome console at an SVG element like a rect, you can see, in the properties tab, all the objects it inherits from: Object, EventTarget, Node, Element, SVGElement, SVGGraphicsElement, SVGGeometryElement, SVGRectElement, rect.

All have different roles. To simplify: Node relates to their relative place in the document hierarchy, EventTarget, to events, and Element and its children, to attributes, style and the like. The various SVG-prefixed objects all implement specific methods and properties. When we select a Node object as we’ve done above with svg.querySelector(“rect”) and the like, note that there’s not a Node object on one side, then an Element object somewhere else, a distinct SVGGeometryElement, and so on and so forth. What is retrieved is one single object that inherits all methods and properties of Nodes, Elements, EventTargets, and so on and so forth, and, as such, that behaves like a Node, like an Element, etc.

Setting and getting attributes

You can set attributes with the Element.setAttribute method.

var rect = document.querySelector("rect");
rect.setAttribute("x", 100);
rect.setAttribute("y", 100);
rect.setAttribute("width", 100);
rect.setAttribute("height", 100);

To be honest, I’m a big fan of the shorthand method in d3js,

var rect = d3.select("rect");
rect.attr({x: 100, y: 100, width: 100, height: 100});

Also, the Element.setAttribute method doesn’t return anything, which means it can’t be chained (which may or may not be a bad thing, though it’s definitely a change for d3js users). It’s not possible to set several attributes in one go either, although one could create a custom function, or, for the daring, extend the Element object for that.

Likewise, the Element object has a getAttribute method :

rect.getAttribute("x"); // 100

Classes, IDs and tag names

Classes, IDs and tag names are special properties of the Element objects. It’s extremely common to add or remove classes to elements in visualization: my favorite way to do that is to use the classed method in d3js.

d3.select("rect").classed("myRect", 1)

In vanilla javascript, you have the concept of classList.

document.querySelector("rect").classList; // ["myRect"]

ClassList has a number of cool methods. contains checks if this Element is of a certain class, add adds a class, remove removes a class, and toggles, well, toggles a class.

document.querySelector("rect").classList.contains(["myRect"]); // true
document.querySelector("rect").classList.remove("myRect");
document.querySelector("rect").classList.add("myRect");
document.querySelector("rect").classList.toggle("myRect");

How about IDs ? with d3js, you’d have to treat them as any other property (rect.attr(“id”)). In vanilla javascript, however, you can access it directly via the id property of Element. You can also do that with the name property.
Finally, you can use the tagName to get the type of element you are looking at (though you cannot change it – you can try, it just won’t do anything).

document.querySelector("rect").id = "myRect"; // true
document.querySelector("rect").name; // undefined;
document.querySelector("rect").tagName; // "rect"
document.querySelector("rect").tagName = "circle";
document.querySelector("rect").tagName; // "rect"

Text

Text is a pretty useful aspect of visualization! it is different from attributes or styles, which are set in the opening tag of an element. The text or content is what is happening in between the opening and closing tags of that element. This is why, in d3js, text isn’t set using attr or style, but either by the html method for HTML elements like DIVs or Ps, or by the text method for SVG elements like <text> and <tspan>.

Those have equivalent in the DOM + javascript world.

HTMLelements have the .innerHTML and outerHTML properties. The difference between the two is that outerHTML includes the opening and closing tags. innerHTML and outerHTML both return HTML, complete with tags and syntax.

SVG elements, however, don’t have access to this property, so they have to rely on the Node property textContent. HTML elements also have access to it, by the way. textContent returns just the plain text content of what’s in the element. All three properties can be used to either get or set text.

Style

In d3js, setting styles to elements is very similar to setting attributes, only we use the .style method instead of the .attr one. It’s so much similar that it’s a rather common mistake to pass as attribute what should be a style and vice-versa! Like with attributes, it is possible to pass an object with keys and values to set several style properties at once.

d3.selectAll("rect").style("fill", "red");
d3.selectAll("rect").style({stroke: "#222", opacity: .5});

In the world of DOM and vanilla javascript, style is a property of the HTMLElement / SVGElement objects. You can set style properties one at a time:

rect = document.querySelector("rect");
rect.style.fill = "red";
rect.style.stroke = "#222";
rect.style.opacity = .5;

Technically, .style returns a CSSStyleDeclaration object. This object maintains a “live” link to what it describes. So:

myStyle = rect.style;
rect.style.fill = "yellow";
myStyle.fill; // "yellow"

Finally, the window object has a getComputedStyle method that can get the computed styles of an element, ie how the element is actually going to get drawn. By contrast, the style property and the d3js style method only affect the inline styles of an element and are “blind” to styles of its parents.

myStyle = window.getComputedStyle(rect, null);
myStyle.fill; // "yellow"

Adding and removing events

In d3js, we have the very practical method “on” which let users interact with elements and can trigger behavior, such as transformations, filtering, or, really, any arbitrary function. This is where creating visualizations with SVG really shines because any minute interaction with any part of a scene can be elegantly intercepted and dealt with. Since in d3js, elements can be tied with data, the “on” methods takes that into account and passes the data element to the listener function. One of my favorite tricks when I’m developing with d3js and SVG is to add somewhere towards the end the line:

d3.selectAll("*").on("click", function(d) {console.log(d);})

Which, as you may have guessed, displays the data item tied to any SVG element the user could click on.

In the world of the DOM, the object to which events methods are attached in the EventTarget. Every Element is also an EventTarget (and EventTarget could be other things that handle events too, like xhr requests).

To add an event listener to an element, use the addEventListener method like so.

document
  .querySelector("rect")
  .addEventListener("click", function() {
     console.log("you clicked the rectangle!"
   }, false);

The first parameter is the type of event to listen to (just as in “on”), the second is the listener function proper. The third one, “use capture”, a Boolean, is optional. If set to true, it stops the event from propagating up and being intercepted by event listeners of the parents of this element.

There is also a “removeEventListener” method that does the opposite, and needs the same elements: in other words, yes, you need to pass the same listener function to be able to stop listening to the element. There is no native way to remove all event listeners from an element, although there are workarounds.

Creating and removing elements

Selecting and modifying elements is great, but if you are creating a visualization, chances are that you want to create elements from scratch.

Let’s first talk about how this is done in the DOM/javascript, then we’ll better understand the data joins and d3 angle.

Node objects can exist outside of the hierarchy of the DOM. Actually, they must first be created, then be assigned to a place in the DOM.

Until a Node object is positioned in the DOM, it is not visible. However, it can receive attributes, styles, etc. Likewise, a Node object can be taken from the DOM, and still manipulated.

To create an HTML element, we can use the document.createElement() method:

var myDiv = document.createElement("div");

However, that won’t work for SVG elements – remember in an earlier example, we used the createElementNS method. This is because SVG elements have to be created in the SVG namespace. d3js old-timers may remember that in the first versions, we had to deal with namespaces when creating elements in d3js as well, but now this all happens under the hood.

Anyway, in vanilla javascript, this is how it’s done:

var svgns = "http://www.w3.org/2000/svg";
var myRect = document.createElementNS(svgns, "rect");

Warning, because document.createElement(“rect”) will not produce anything useful as of this writing.

Once the new Node objects are created, in order to be visible, they should be present in the DOM. Because the DOM is a tree, this means that they have to have a parent.

svg.appendChild(myRect);

Likewise, to remove a Node from the DOM means to sever that relationship with its parent, which is done through the removeChild method:

svg.removeChild(myRect);

Again, even after a Node has been removed, it can still be manipulated, and possibly re inserted at a later time.

Nodes don’t remove themselves, but you can write:

myRect.parentNode.removeChild(myRect);

In contrast, here is how things are done in d3js.

The append method will add one element to a parent.

d3.select("svg").append("rect");

The remove method will remove one entire selection object from the DOM.

d3.select("svg").selectAll("rect").remove(); // removes all rect elements which are children of the SVG element

But the most intriguing and the most characteristic way to create new elements in d3js is to use a data join.

d3.select("svg")
  .selectAll("rect")
  .data(d3.range(5))
  .enter()
  .append("rect");

The above snippet of code counts all the rect children of the svg element, and, if there are fewer than 5 – the number of items in d3.range(5), which is the [0,1,2,3,4] array – creates as many as needed to get to 5, and binds values to those elements – the contents of d3.range(5) in order. If there are already 5 rect elements, no new elements will be created, but the data assignment to the existing elements will still occur.

Data joins, or the lack thereof

The select / selectAll / data / enter / append sequence can sound exotic to people who learn d3js, but to its practitioners, it is its angular stone. Not only is it a quick way to create many elements (which, in vanilla javascript, takes at least 2 steps. Creating them, and assigning them to the right parent), but it also associates them with a data element. That data element can then be accessed each time the element is being manipulated, notably when setting attributes or styles and handling events.

For instance,

d3.selectAll("rect")
  .attr("x", function(d) {return 20 * d;});

the above code utilizes the fact that each of the rectangle have a different number associated with them to dynamically set an attribute, here position rectangles horizontally.

d3.selectAll("rect")
  .on("click", function(d) {console.log(d);})

A trick I had mentioned above, but which illustrates this point: here by clicking on each rectangle, we use the data join to show the associated data element.

Having data readily available when manipulating elements in d3js is extremely convenient. After all, data visualization is but the encoding of data through visual attributes. How to perform this operation without the comfort of data joins?

Simply by going back to the dataset itself.

Consider this:

var data = [];
var i;
for (i = 0; i < 100; i++) {
  data.push({x: Math.random() * 300, y: Math.random() * 300}); // random data points
}

// d3 way
var d3svg = d3.select("body").append("svg");
d3svg.selectAll("circle").data(data).enter().append("circle")
  .attr({cx: function(d) {return d.x;}, cy: function(d) {return d.y}, r: 2})
  .style({fill: "#222", opacity: .5});

// vanilla js way
var svgns = "http://www.w3.org/2000/svg";
var svg = document.createElementNS(svgns, "svg");
document.querySelector("body").appendChild(svg);
for (i = 0; i < 100; i++) {
  var myCircle = document.createElementNS(svgns, "circle");
  myCircle.setAttribute("cx", data[i].x);
  myCircle.setAttribute("cy", data[i].y);
  myCircle.setAttribute("r", 2);
  myCircle.style.fill = "#222";
  myCircle.style.opacity = .5;
  svg.appendChild(myCircle);
}

See the Pen eNzaYg by Jerome Cukier (@jckr) on CodePen.

Both codes are equivalent. Vanilla JS is also marginally faster, but d3 code is much more compact. In d3js, the process from dataset to visual is:

Joining dataset to container,
Creating as may children to container as needed, [repeat operation for as many levels of hierarchy as needed],
Use d3 selection objects to update the attributes and styles of underlying Node objects from the data items which have been joined to them.

In contrast, in vanilla Javascript, the process is:

Loop over the dataset,
create, position and style elements as they are read from the dataset.

For visuals with a hierarchy of elements, the dataset may also have a hierarchy and could be nested. In this case, there may be several nested loops. While the d3js code is much more compact, the vanilla approach is actually more simple conceptually. Interestingly, this is the same logic that is at play when creating visualization with Canvas or with frameworks like React.js. To simply loop over an existing, invariant dataset enables you to implement a stateless design and take advantage of immutability. You don’t have to worry about things such as what happens if your dataset changes or the status that your nodes are in before creating or updating them. By contrast most operations in d3js assume that you are constantly updating a scene on which you are keeping tabs. In order to create elements, you would first need to me mindful on existing elements, what data is currently associated with them, etc. So while the d3js approach is much more convenient and puts the data that you need at your fingertips, the vanilla JS approach is not without merits.

Loading files

The first word in data visualization is data, and data comes in files, or database queries. If you’re plotting anything with more than a few datapoints, chances are you are not storing them as a local variable in your javascript. d3js has a number of nifty functions for that purpose, such as d3.csv or d3.json, which allow to load the files asynchronously. The trick in working with files is that it can take some time, so some operations can take place while we wait for the files to load, but some others really have to wait for the event that the file is loaded to start. I personally almost always use queue.js, also from Mike Bostock, as I typically have to load data from several files and a pure d3 approach would require nesting all those asynchronous file functions. But, for loading a simple csv file, d3js has a really simple syntax:

d3.csv("myfile.csv", function(error, csv) {
  // voila, the contents of the file is now store in the csv variable as an array
})

For reference, using queue js, this would look like

queue()
 .defer(d3.csv, "myFirstFile.csv")
 .defer(d3.csv, "mySecondFile.csv")
 .await(ready);

function ready(error, first, second) {
  // the contents of myFirstFile is stored as an array in the variable "first",
  // and the contents of mySecondFile are in the variable "second".
}

The way to do the equivalent in vanilla Javascript is to use XMLHttpRequest.

function readFile() {
  var fileLines = this.responseText.split("\n");
  var fields = fileLines[0].split(",");
  var data = fileLines.slice(1).map(function(d) {
    var item = {};
    d.split(",").forEach(function(v, i) {item[fields[i]] = v;})
    return item;
  })

  var request = new XMLHttpRequest();
  request.onload = readFile;
  request.open("get", "myFile.csv", true);
  request.send();

The syntax of loading the file isn’t that cumbersome, and there are tons of nice things that can be done through XMLHttpRequest(), but let’s admit that d3js/queue.js functions make it much more comfortable to work with csv files.

Animations

d3js transitions is one of my favorite part of the library. I understand it’s also one the things which couldn’t be done well in protovis and which caused that framework to break. It feels so natural: you define what you want to animate, all that needs to change, the time frame and the easing functions, and you’re good to go (see my previous post on animations and transitions). In native javascript, while you can have deep control of animations, it’s also, unsurprisingly, much more cumbersome. However, CSS3 provides an animation interface which is comparable in flexibility, expressiveness and ease of use to what d3js does. First let’s get a high-level view of how to do this entirely within JS. Then let’s get a sense of what CSS can do.

requestAnimationFrame and animation in JavaScript

JavaScript has timer functions, window.setTimeout and window.setTimeinterval, which let you run some code after a certain delay or every so often, respectively. But this isn’t great for animation. Your computer draws to screen a fixed number of times per second. So if you try to redraw the same element several times before in between those times, it’s a waste of resources! What requestAnimationFrame does is tell your system to wait for the next occasion to draw to execute a given function. Here’s how it will look in general.

function animate(duration) {
  var start = Date.now();
  var id = requestAnimationFrame(tick);
  function tick() {
    var time = Date.now();
    if (time - start < duration) {
       id = requestAnimationFrame(tick);
       draw(time - start / duration);
    } else {
      cancelAnimationFrame(id);
    }
  }
  function draw(frame) {
    // do your thing, update attributes, etc.
  }
}

See the Pen PqzgLV by Jerome Cukier (@jckr) on CodePen.

OK so in the part I commented out, you will do the drawing proper. Are you out of the woods yet? well, one great thing about d3js transitions is that they use easing functions, which transform a value between 0 and 1 into another value between 0 and 1 so that the speed of the animation isn’t necessarily uniform. In my example, you have (time – start) / duration represents the proportion of animation time that has already elapsed, so that proportion can be further transformed.

So yay we can do everything in plain javascript, but that’s a lot of things to rewrite from scratch.

CSS3, animations and transitions

(This is not intended to be an exhaustive description of animations and transitions, a subject on which whole books have been written. Just to give those who are not familiar with it a small taste).

In CSS3, anything you can do with CSS, you can time and animate. But there are some caveats.

There are two similar concepts to handle appearance change in CSS: animations and transitions. What is the difference?

with animations, you describe a @keyframes rule, which is a sequence of states that happen at different points in time in your transition. In each of these events, any style property can be changed. The animation will transform smoothly your elements to go from one state to the next.
in transitions, you specify how changes to certain properties will be timed. For instance, you can say that whenever opacity changes, that change will be staged over a 1s period, as opposed to happen immediately.

Both approaches have their uses. CSS3 animations are great to create complex sequences. In d3js, that requires to “chain transitions”, which is the more complex aspect of managing them. By contrast, going from one segment of the animation to another is fairly easy to handle in CSS3. Animations, though, require the @keyframes rule to be specified ahead of time in a CSS declaration file. And yes, that can be done programmatically, but it’s cumbersome and not the intent. The point is: animations work better for complex, pre-designed sequence of events.

Transitions, in contrast, can be set as inline styles, and work fine when one style property is changed dynamically, a scenario which is more likely to happen in interactive visualizations.

Here’s how they work. Let’s start with animations.

See the Pen vOKKYN by Jerome Cukier (@jckr) on CodePen.

Note: as of this writing, animation-related CSS properties have to be vendor prefixed, i.e. you have to repeat writing these rules for the different browsers. Here’s a transition in action.

See the Pen xGOqOR by Jerome Cukier (@jckr) on CodePen.

For transition, you specify one style property to “listen” to, and say how changes to that property will be timed, using the transition: name of property + settings. (in the above example: transition: transform ease 2s means that whenever the “transform” style of that element changes, this will happen over a 2s period with an easing function).

One big caveat for both CSS animations and transitions is that they are limited to style properties. In HTML elements this is fine because everything that is related to their appearance is effectively handled by style: position, size, colors, etc. For SVG, however, color or opacity are styles, like in HTML, but positions, sizes and shapes are attributes, and can’t be directly controlled by CSS. There is a workaround for positions and sizes, which is to use the transform style property.

But wait: isn’t transform an SVG attribute as well? that’s right. And that’s where it can get really confusing. Many SVG elements are positioned through x and y properties (attributes). They can also have a transform property which is additive to that. For instance, if I have a <rect> which has an x property of 100 and a transform set at “translate(100)”, it will be positioned 200px right of its point of origin. But on top of that, SVG elements can have a transform style which affects pretty much the same things (position, scales, rotation…) but which has a slightly different syntax (“translate(100)”, for instance, wouldn’t work, you’d have to write “translateX(100px)”). What’s more, the transform set in the style doesn’t add to the one set in the properties, but it overrides it. If we add a “transform: translateX(50px)” to our <rect>, it will be positioned at 150px, not 200px or 250px. Another potential headache is that some SVG elements cannot support transform styles.
While any of these properties can be accessed programmatically, managing their potential conflicts and overlaps can be difficult. In the transition example above, I have used the transform/translateX syntax.

That said, a lot of awesome stuff can be done in CSS only. For scripted animations, the animation in pure CSS is definitely more powerful and flexible than the d3js equivalent, however, when dealing with dynamically changing data, while you can definitely handle most things through CSS transitions, you’ll appreciate the comfort of d3js style transitions.

See the Pen GJqrLO by Jerome Cukier (@jckr) on CodePen.

Now a common transformation handled by d3js transitions is to transform the shape of path shapes. This is impossible through CSS animations/transitions, because the shape of the path – the “d” – is definitely not a style property. And sure, we can use a purely programmatic approach with requestAnimationFrame but is there a more high level way?
It turns out there actually is – the animation element of SVG, or SMIL. Through SMIL, everything SVG can be animated with an interface, this includes moving an object along a path, which I wouldn’t know how to do on top of my head in d3js. Here is an extensive explanation of how this works and what can be done.

Data processing, scales, maps and layouts

For the end of the article let’s talk about all of which could technically be done without d3js, but in a much, much less convenient way. Therefore, I won’t be discussing alternatives with vanilla Javascript, which would be very work intensive and not necessarily inventive.

Array functions and data processing

d3js comes with a number of array functions. Some are here for convenience, such as d3.min or d3.max which can easily be replaced by using the native reduce method of arrays. When comparing only two variables, d3.max([a, b]) is not much more convenient than Math.max(a,b) or a > b ? a : b.

Likewise, d3js has many statistical functions, which saves you the trouble to implement them yourself if you need them, such as d3.quantile. There are other libraries who do that, but they’re here and it’s really not useful to recode that from scratch.

d3js comes with shims for maps and sets, which will be supported by ES6. By now, there are transpilers which can let you use ES6 data structures. But it’s nice to have them.

In my experience, d3js most useful tool in terms of data processing is the d3.nest function, which can transform an array of objects into a nested object. (I wrote this article about them). Underscore has something similar. While you can definitely getting a dataset of any shape and size by parsing an array and performing any grouping or operations manually, this is not only very tedious but also extremely error prone.

Scales

Scales are one superb feature of d3js. They are simple to understand, yet very versatile and convenient. Oftentimes, d3 scales, especially the linear ones, are a replacement for linear algebra.

d3.scale.linear().range([100, 500]).domain([0, 24])(14);
((14 - 0) / (24 - 0)) * (500 - 100) + 100; // those two are equivalent (333.33...)

but changing the domain or the range of a scale is much safer using the scale than adhoc formulas. Add to this scale goodness such as the ticks() method or nice() to round the domain, and you get something really powerful.

So, of course it is possible (and straightforward, even) to replace the scales but that would be missing out one of the best features of d3js.

Maps

d3js comes with a full arsenal of functions and methods to handle geographic projections, ie: the transformation of longitude/latitude coordinates into x,y positions on screen. Those come in two main groups, projections proper that turn an array of two values (longitude, latitude) into an array of two values (x, y). There are also paths functions that are used to trace polygons, such as countries, from specially formatted geographic files (geoJSON, topoJSON).

The mercator projection may be straightforward to implement, But others are much less so. The degree of comfort that d3js provides when visualizing geographical data is really impressive.

Layouts

d3js layouts, that is special visual arrangements of data, were in my opinion one of the key drivers of protovis (where they originated) then d3js adoption. Through layouts, it became simple to create, with a few line of codes, complex constructions like treemaps or force-directed networks. Some layouts, like the pie chart or the dendogram, are here for convenience and could be emulated. Others, and most of all the force layout, are remarkably implemented, efficient and versatile. While they are called different names in d3js, geometry functions such as voronoi tessellation or convex hulls are similar functionally and there is little incentive in reproducing what they do in plain javascript.

Should I stop using d3?

d3js is definitely the most advanced javascript visualization library. The point of this article is not to get you to stop using it, but rather, to have a critical thinking in your code. Even with the best hammer, not everything is a nail.

To parse the DOM, manipulate classes and listen to events, you probably don’t need a library. The context of your code may make it more convenient to use d3 or jQuery or something else, but it’s useful to consider alternatives.
The concept of the data join unlocks a lot of possibilities in d3js. A good understanding of data join would lead you to implement your visualization much faster, using more concise code and replicable logic. It also makes trouble shooting easier. Data joins are especially useful if you have a dataset which is structured like your visualization should be, or if you plan to have interaction with your visualization that requires quick access with the underlying data. However, data joins are not necessary in d3js, or in visualization in general. In many cases, it’s actually perfectly sensible to parse a dataset and create a visual representation in one fell swoop, without attaching the data to its representation or overly worrying about updating.

Assuming you have d3js loaded, nothing prevents you from creating elements using d3js append methods instead of vanilla javascript. Or to listen to events using addEventListener rather than with d3js on method. It’s totally ok to mix and match.

Like data joins, transitions are a very powerful component of d3js, and, once you’re comfortable with them, they are very expressive. There are other animation frameworks available though, which can be better adapted to the task at hand.
Scale, maps, layouts and geometries are extremely helpful features however and I can think of no good reason to reimplement them.

Credit where it’s due

To write this I drew inspiration from many articles and I will try to list them all here.
The spark that led me to write was Lea Verou’s article jQuery considered harmful, as well as articles she cites (you might not need jQuery, you don’t need jQuery, Do you really need jQuery?, 10 tips for writing JavaScript without jQuery).

Most of the information I used especially in the beginning of the article comes more or less directly from MDN documentation, and direct experimentation. For CSS and animation, I found the articles of Chris Coyier (such as this one or this one) and Sara Soueidan (here) on CSS Tricks to be extremely helpful. Those are definitely among the first resources to check out to go deeper on the subject. Sara was also the inspiration behind my previous post, so thanks to her again!

Finally, I’ve read Replacing jQuery with d3 with great interest (like Fast interactive prototyping with d3js and sketch about a year ago). It may seem that what I write goes in the opposite direction, but we’re really talking about the same thing -that there are many ways to power front ends and that it’s important to maintain awareness of alternative methods.

February 7, 2015 by jerome on charts, dashboards, data visualization, tips

Charts, assemble!

From the past posts, you would have gathered that dashboards are tools to solve specific problems. They are also formed from individual charts and data elements.

Selecting information

That dashboards are so specific is great, because the problem that they are designed to solve will help choosing the information that we need and also prioritizing it – two essential tasks in dashboard creation. Again, we don’t want to shove every data point we have.

Another great tool to help us do those two tasks is user research. As a designer, we may think we chose the right metrics, but they have to make sense to real users and resonate with them. The bias that we may have is that we would favor data which is easy to obtain or that makes sense to us, compared to data which can be more elaborate, more sophisticated or more expensive to collect or compute, even if that makes more sense to the user.

Here’s an illustration of that.

When I was working at Facebook on this product, Audience Insights, we designed this page to help marketers understand how a group of users they could be interested in used Facebook. (The link / screenshot showcases fans of the Golden State Warriors). One of the main ways we classified users at Facebook, for internal purposes, is by counting how many days of the last four weeks they have been on Facebook. It’s a metric called L28 and one of the high-level things Facebook knows about everyone. So, we integrated it in the first version of this page. But, even though it’s not a concept unique to Facebook, it wasn’t that useful to our users, and it was taking space from a more relevant indicator.

Instead, we have included indicators which are more relevant to the task at hand (ie getting a sense of who those users are through the prism of their activity). For instance we can see that very few Warriors fans only use Facebook on their computer, compared to the general population of US Facebook users. They tend to skew more towards Android and mobile web (going to www.facebook.com from their phone, versus using an app.) They tend to be more active in terms of likes, comments and shares.

Information hierarchy

Once information is chosen and you get a sense of what is more important than the rest, it’s time to represent that visually.

Here are some of the choices you can make.

Show some metrics on top or bigger than others.

That’s probably the first thing that comes to mind when thinking hierarchy and prioritization. And it needs to be done! Typically, you should get one to three variables that really represent the most important thing you want your users to read or remember. If you come up with more than 3, you should refine your question/task and possibly split it in two.

The rest of the variables will support these very high level metrics. Again, in a typical situation, you could come up with up to three levels of data (with more than three being a good indication to rethink your scope). Some metrics can support the high-level metrics (i.e. show them with a different angle, or explain them) and some metrics could in turn support them.

Present some metrics together.

Stephen Few argues that dashboards should fit on one page or one screen because their virtue is to present information together. With the flexibility offered by the modern web, and the size constraints of mobile, this is a requirement that shouldn’t be absolute. But it’s relevant to remember that some variables add value when seen along other variables. With that in mind, you can have part of your dashboard as a fixed element (always visible on screen) while the rest can scroll away, for instance.

Push some metrics to secondary cards (such mouseovers, pop-ups or drill-down views)

Hierarchizing information is not just about promoting important information. It’s also about demoting information which, while is useful in its own right, doesn’t deserve to steal the show from the higher level metric. The great thing about interactive dashboards is that there are many mechanisms for that. Some information can be kept as “details on demand” and only shown when needed.

Figure out what form to give to the data

So you have data. It probably changes over time, too (and you have that history as well!). And a sense of how important it is.

You can represent it as a static number (and, further, to adjust the precision of that number) or as a time series (i.e. line graph, area graph, bar graph etc.), or both.

The key question to answer is whether the history and how the metric moved over time is relevant and important, versus the latest figures. If you think that the history is always important or that it doesn’t hurt to have it for context anyway, consider that it’s yet another visual element to digest, another thing that can be misinterpreted, and that unless its importance is clearly demonstrated, you’d rather not include it. Yes – even as a tiny sparkline.

Here is another example from my work at Facebook of a page where proper hierarchy has been applied.

Page Insights, to use a parallel with a better known product, is like google analytics, only for Facebook Pages instead of web sites. Unsurprisingly, the metric we put to the top left is the Page Likes, which is the number of people who like a page. The whole point of the system is to let people understand what affects that number and how to grow it. Two other high-level metrics are shown on the same row in the two cards on the right: the Post Reach for the week (number of people who have seen content from this page this week, whether they like the Page or not) and Engagement (number of people who acted on the content – actions could be liking, commenting, sharing, clicking, etc.)

The number of new Page Likes of the past week, which is represented as a both a line chart and a number in the left card, is an example of a level two metric. It supports the top metric – total likes. The number of Page Likes of the past week, which is represented as a line chart only, is a level three metric. It’s here just as a comparison to the number of the current week – here, it helps us figuring out that last week has been a better week.

Connecting the dots

Ultimately, a dashboard is more than a collection of charts. It’s an ensemble: charts and data are meant to be consumed as a whole, with an order and a structure. Charts that belong together should be seen together. The information gained like so will be much more useful than from looking at them in sequence.

Linking, for instance, is the concept of highlighting an element in a given chart with repercussions on other charts, depending on the element highlighted. A common use case is to look at a data for one given time point, and see the value for that time point highlighted in related charts. Here is an example:

In this specific case, the fact that both charts share the same x-axis makes comparing the shape of both charts easier even without linking.

Each variable doesn’t have to be on its own chart. Your variables can have an implicit relation between one another. Bringing them together might make that relation explicit. Here are some interesting relationship between variables or properties of variables that can be made apparent through the right chart choice.

One variable could be always greater than another one, because the second is a subset of the first. Here are some examples:

The number of visits on a website last week will always be greater or equal than the number of unique visitors that week, which will always be greater than the number of visitors last day.
The number of visitors will always be greater to the number of first-time visitors.
The cumulative number of orders over a period of time will always be greater than the number of daily orders over that same period.
The time that users spend with a website in an active window of their browser will always be greater than the time they spend actively interacting with the site.

What’s interesting here is that these relations are not just true because of experience, they are true by definition. It’s also metrics that are expressed in the same units, and, in most cases, with the same order of magnitude, so they can be displayed on the same chart. When applicable, showing them together can show how they, indeed, move together or not.

One variable could be the sum of two other, less important variables.

In the example below we go even one step further and we show that one variable is the sum of two variables minus a fourth one.
Here, we look at the net likes of a Facebook Page, that is, the difference between the number of people who like a page on a given day and the day before.
Two factors can make more people like a page: paid likes (a user sees an ad, is interested, and from it, likes the page) or organic likes (a user visits a page, or somehow see content from that page, and likes it, without advertisement involved). Finally, people may also decide to stop liking the page (“unlikes”).
Here, net likes = organic likes + paid likes – unlikes. The reason why we have decomposed Likes between organic and paid is because we wanted to show that ads can amplify the effect of good content. So, visually, we chose to represent that as a layer on top of the rest. (important remark: your dashboard doesn’t have to be neutral. If it can show that your product, company, team etc. is delivering, and you have an occasion to demonstrate it, don’t hesitate a moment). By showing the unlikes as a negative number, as opposed to a positive variable, going up, possibly above the likes (which would be unpredictable) we can keep the visual legible and uncluttered. A user can do the visual combination of all these variables. This chart, by the way, shows the typical dynamic of a Page : new content will generate peaks of new users, but also will cause some users to stop liking the page.

One variable could be always growing. Or always positive.

When that is the case this can be used to make choices to represent the chart. If a variable is always growing by nature (i.e. cumulative revenue) you may want to consider representing a growth rate rather than the raw numbers. A reason to consider that is that your axis scale will have to change over time (i.e. if you plot a product that sells for around $1m per day, having an axis that goes from 0 to $10m would be enough for a week, but not for a month let alone for a year, whereas with a growth rate you can represent a long period of time consistently). And if a variable is always positive (ie stock price), your y axis can start at 0, or even at an arbitrary positive value, as opposed to allocate space for negative values.

Conversely, if a variable doesn’t change over time, it doesn’t mean that it’s not interesting to plot. That absence of change could be a sign of health of the system (which is the kind of task that dashboards can be useful for). So the absence of change doesn’t mean that there’s an absence of message.

February 7, 2015 by jerome on charts, dashboards, data visualization, tips

Dashboards as products

In the past few articles I’ve exposed what dashboards are not:

an exercise in visual design,
an exercise in data visualization technique.

Another way to put this is that “let’s do this just because we can” is a poor mantra when it comes to designing dashboards, or visualizations in the broader sense by the way.

Do it for the users

Now saying that dashboards should be products is a bit tautological. Products, in product design, refer to the result of a holistic process that solves problems of users – a process that includes research, conception, exploration, implementation and testing.

Most importantly, it’s about putting the needs of your users first. And your users first. Interestingly, treating your dashboard as a product means that the dashboard – your product – doesn’t come first.

Creating an awesome dashboard is a paradox. Googling for that phrase yields results such as: 20+ Awesome Dashboard Designs That Will Inspire You, 25 Innovative Dashboard Concepts and Designs, 24 beautifully-designed web dashboards that data geeks or 25 Visually Stunning App Dashboard Design Concepts. This is NOT dashboard product design (though it’s a good source of inspiration for visual design of individual charts).

Eventually, no one cares for your dashboard. When designing a dashboard, it’s nice to think that somebody out there will now spend one hour everyday looking at all this information nicely collected and beautifully arranged, but who would want to do that? Who would want to add to their already busy day an extra task, just to look at information the way you decided to organize it? This point of view is a delusion. We must not work accordingly.

Instead, let’s focus on the task at hand. What is something that your users would try to accomplish that could be supported by data and insights?

What is the task at hand?

If you start to think “show something at the weekly meeting” or “make a high-level dashboard” I invite you to go deeper. Show what? a dashboard for what? not for its own sake.

Trickier – how about: “to showcase the data that we have”? That is still not good enough. You shouldn’t start from your data to create your dashboard, and for several reasons. Doing so would limit yourself to the data that you have or which is readily available for you. But maybe that this data, in its raw form, is not going to be relevant or useful to your users. Conversely, you would be tempted to include all the data that you have, but each additional information that you bring to your dashboard would make it harder to digest and eventually detrimental to the process. Most importantly, if you don’t have an idea of what the user would want to accomplish with your data, you cannot prioritize and organize it, which is the whole point of dashboard design.

Finally – “to discover insights” is not a task either. Dashboards are a curated way to present data for a certain purpose. They are not unspecified, multi-purpose analytical exploration tools. In other words: dashboards will answer a specific, already formulated question. And they will answer in the best possible way, if they are designed as such. For exploration, ad-hoc analysis is more efficient, and is probably best left to analysts or data scientists than end users.

Here are some example of tasks:

check that things are going ok – that there is no preventable disaster going on somewhere. For instance: website is up – visits follow a predictable pattern.
Specifically, check that a process had completed in an expected way. For instance: all payments have been cleared.
If something goes wrong, troubleshoot it – find the likely cause. For instance: sales were down for this shop… because we ran out of an important product. Order more to fix the problem, make sure to stock accordingly next time.
Support a tactical decision. For instance: here are the sales of the new product, here are the costs. Should we keep on selling it or stop?
Decide where to allocate resources. For instance: we launched three variations of a product, one is greatly outperforming the other two, let’s run an ad campaign to promote the winner.
Try to better understand a complex system. For instance: user flow between pages can show where users are dropping out or where efficiency gains lie.

This list is by no means limitative. But it’s really useful to start from the problem at hand than just try to create a visual repository for data.

Next, we’ll see how to implement these in the last article: charts assemble!

February 7, 2015 by jerome on charts, dashboards, data visualization, tips

Dashboards versus data visualization

Dashboards are extreme data visualizations

In the recent Information is Beautiful 2014 awards, I found interesting that there is an infographics and a data visualization categories. My interpretation is that the entries in the infographics section are static and illustrated, while those in the data visualization are generated and data-driven. However, all the featured data visualization projects are about a one-off dataset. So aesthetical choices of the visualization depend on the characteristics of this particular dataset. By contrast, the dashboards I have worked with are about a live, real-time datastream. They have to look good (or at least – to function) whatever the shape and size of the data that they show. The google quote and news chart that we saw earlier must work for super volatile shares, for more stable ones, for indices, currencies, etc. So, if the distinction between infographics and data visualization makes sense to you, imagine that dashboards sit further in that continuum than data visualization. Not only are dashboards generated from data, like data visualizations, but they are also real-time and should function with datasets of many shapes and sizes.

But dashboards problems are not data visualization problems

Data visualization provides superior tools and techniques to present or analyze data. With libraries and languages dedicated to making visualizations, there is little that can’t be done. In many successful visualizations, the author will create an entirely new form, or at least control the form very finely to match their data and their angle. Even without inventing a new form, there are many which have been created for a specific use, and which are relatively easy to make on the web (as opposed to say, in Excel): treemaps, force-directed graphs and other node-link diagrams, chord diagrams, trees, bubble charts and the like. And even good old geographic maps.

In most cases, it is not a good idea to be too clever and have a more advanced form.

Up until mid November 2014, Google Analytics allowed users to view their data using motion charts.

This was really an example of having a hammer and considering all problems as nails. Fortunately, this function disappeared from the latest redesign.

Likewise, on twitter followers dashboard, the treemap might be a bit over the top:

and possibly confusing and not immediately legible to some users. On the other hand, it is economical in terms of space and would probably work in almost every case which are two things that dashboards should be good at. So while I wouldn’t have used it myself I can understand why this decision has been made.

Dashboards are not an exercise in visual design either

A dashboard such as this:

(for which I can’t find the source. I found it on pinterest and was able to trace it to this post but not prior) is well designed visually, it makes proper use of space, colors and type, its charts are simple.

But what good is it? what do I learn, what can I take away from it, what actions can I perform?

Most of the dashboards examples I find on sites like dribbble or beyance (see my Pinterest board) fall into that category: inspiring visual design, probably not real data, no flow, no obvious use.

Dashboards are problems of their own

What makes a dashboard, or any other information-based design successful, is neither the design execution nor the clever infovis technique. Dashboards, eventually, are meant to be useful and to solve a specific problem.

How so? We’ll see in the next article: dashboards as products.

February 7, 2015 by jerome on charts, dashboards, data visualization, tips

Charts in the age of the web

In 2008, when I was working at OECD, my job description was that of an editor. That implied I was mostly working on books. I was designing charts, but they were seen as components of books. And this was typical of the era.

So we would create charts like this one:

And it was awesome! (kind of). I mean, we respected all the rules. Look at that nicely labelled y-axis! and all the categories are on the x-axis! the bars are ordered in size, it’s easy to see which has the biggest or smallest value! And with those awesome gridlines, we can lookup values – at least get an order of magnitude.

What we really did though was apply styling to an excel chart (literally).

Print charts vs interactive charts

Origin of rules for print charts

Rules that govern traditional charts (which are many: ask Tufte, Few) make a certain number of assumptions which are interesting to question today.

One is that charts should be designed so that values can be easily looked up (even approximately) from the chart. This is why having labeled axes and gridlines is so useful. This is also why ordering bar charts in value order is nice. With that in mind, it also makes sense that charts like bar charts or area charts, which compare surfaces, be drawn on axes that start at 0.

The other assumption is that a chart will represent the entirety of a dataset that can be shown at a time. We have to come up with ways to make sure that every data point can be represented and remains legible. The chart author has to decide, once and for all, which is the dataset that will be represented, knowing that there will be “no backsies”.

In the same order of thought, the author must decide the form of his chart. If she wants to compare categories, she may go for a bar chart. If she wants to show an evolution over time, for a line chart. And if she wants the user to have exact values, she will choose a table.

And so, when everything else than a table is chosen, we typically don’t show values with all the data points, because adding data labels would burden the chart and make its overall shape harder to make out.

In this framework, it makes sense to think in term of data-ink (the cornerstone of Tuftean concepts): make sure that out of all the ink needed to print the chart (you can tell it’s a print concept already…), as much should go to encode the data as possible, versus anything else.

How about now

However, there is not a single of these reasons which is valid today in the world of web or mobile charts. Data-ink only made sense on paper.

Web charts have many mechanisms to let the user get extra information on a given data point. That can be information that updates on mouseover, callouts and tooltips… This might be less true of mobile in general where the distinction between hovering and clicking is less distinct. But it is definitely possible to obtain more than what is originally displayed. If I want to have an exact value, I shouldn’t have to simply deduce that from the shape of the chart. There can be mechanisms that can deliver that to me on demand.

An example: Google Finance Quote & News

The Google Finance Quote and News chart is a very representative example of a web-native chart. Around since 2006, they provide the price of a given security, along with news for context. While its visual design has probably been topped by other dashboards, what makes it a great example is that it’s publicly available, which is uncommon for business data.

While this chart has gridlines and labelled axes, that is not enough to lookup precise values. However, moving the mouse over the chart allows the user to read a precise value at a given point in time. A blue point appears and the precise value can be read in the top left corner.

One very common data filter in chart is controls that affect the time range: date pickers. By selecting a different time range, we make the chart represent a different slice of the dataset – we effectively filter the dataset so that only the relevant dates are shown. This is in contrast with the traditional printed charts, again, where all of the dataset is shown at once. For instance, we can click on “6m” and we’ll be treated with data from the last 6 months:

Comparing the selected security with others will make the chart show the data in a different mode. This is the same data (plus added series), in the same screen and the same context, but the chart is visually very different:

As to the other two characteristics of web charts I mentioned, data exports and drill downs, they are also featured (but less graphical to show, so I haven’t captured a screenshot for those). There is a link on a left-side column to get the equivalent data (so it is always possible to go beyond what is shown on screen). The little flags with letters in the 3 first screenshots are clickable, and represent relevant news. Clicking them will highlight that article in a right-side column. So it is always possible to get more information.

What does that change?

Everything.

Rules or best practices based on the assumption that data is hard to lookup or to compare are less important. The chart itself has to be legible though. So, for instance, it’s ok to have pie charts or donut charts, as long as the number of categories doesn’t go totally overboard.

Web charts, and dashboards even more so, should focus on only showing relevant data first, then showing it in the most useful and legible way. Again, a noted difference with the print philosophy where as much data as possible should be shown.

How this play out is what we’ll cover in the next articles of the series: Dashboards versus data visualizations.

December 29, 2012 by jerome on charts, d3, data visualization

Gun murders in America

Click for interactive visualization

I have created this map of every homicide in the USA using firearms for the latest year where detailed information was available. Every, that is from all the agencies that report homicides to the FBI, which is not an obligation – this is why the map lacks Florida data.
In the interactive version you can see how murders happen through the year and explore them according to several criteria that were available in the database. While large shooting sprees receive media attention, unfortunately there are thousands of cases each year in just about every community.
Technically this is my first foray with d3.v3 and it uses two of its major new features, topoJSON for easy, lightweight maps, and hex binning to represent many individual events in one hexagon. Thanks to Mike Bostock for tutoring on that.

March 14, 2011 by jerome on charts, data visualization

Stuff I do with Tableau

I put together a list of some of the things I’ve done with Tableau public there. I had put the link on twitter last Friday, and I just saw the number of connections through bit.ly – never thought something posted just before the week-end could get so much attention! so thanks, twitter followers. I’m putting another link here on the blog for convenience.

I’ve been using Tableau public ever since its release (well actually a tad before). Over time I’ve been trying to use it less as a visually pleasing and convenient way to shove a lot of data in a limited space (like here), and more as a way to promote a certain angle when looking at a dataset (like here), that is, as an invitation to the viewers to reach the same conclusions than us, but by giving them access to the data so they can see for themselves.

One of my first vizzes - just a layer on top of the dataset

This one is less neutral. I would like readers to embrace the opinion in the associated article, so I use Tableau to present the data in a way that supports this opinion.

That long list is still a subset of what I’ve done with Tableau, mostly because like many people in datavis I use Tableau at two stages.

I use it to communicate a finished visualization such as these, although I may use a static image or another interactive tool.

But I also almost systematically use it in the early stages, when I receive a dataset and I need to make sense of it before I can represent it. By manipulating a dataset in Tableau and testing various basic dimension combinations one can quickly see the points of interest in the data and come up with relevant questions to ask the data, to which a visualization is the answer. So while I can’t share these “drafts” they are very very helpful.

Also over time I got better (hopefully) at controlling dashboards so they look exactly the way I want them to and not how I manage to put them together. What helps is setting the size of the dashboard as exact dimensions (ie 600 by 400), not as a range, and, unsurprisingly, to draw the dashboard on paper first. Anyway, all the dashboards which are on that page are freely downloadable if you want to see how they are done

February 22, 2011 by jerome on charts, protovis

The top TV earners are not found on tabloids.

This is my contribution to Flowing Data challenge: visualize this / top tv earners.

When I looked at the dataset provided by Nathan I first wondered what was missing. Quite a few stars were missing in the list. I added Simon Baker whose Mentalist gets a huge audience and who is said to be getting $450k per show. Others who probably should be there include Forest Whittaker for the Criminal Minds spinoff, Jim Belushi in the Defenders, Kate Walsh in Private Practice or Elizabeth Michell in V. I don't think any of them is settling for less than $15k per episode.

Another thing were the audiences. There are a few good sources for that (and some less good) so I tried to approximate how much viewers a show would generate for their channel. I came up with hourly viewers because a show that brings 10 million people to one channel for half an hour should be equivalent to one that gets 5 million viewers for an hour, right?
In fact, I wanted to come up with a proxy of how much value a TV show was generating and how much of that went into the pockets of the cast. It's probably possible to come up with a precise answer with better data than could be assembled by an amateur over a lunch break, audience is probably a part of the equation, but the short story doesn't require much calculation: actors only get the crumbs of a very fat pie.
A show like Grey's Anatomy got $329.1m dollars of ad revenue in 2009, which I assume is an US-only figure, the show being syndicated in many countries, unfortunately including France. And that excludes sales of DVD, paid donwloads and other streams of revenue. Out of that, Patrick Dempsey only got $6m. Now $6m is a lot of money, but actors of successful, established shows don't get a very good deal here. In the last seasons of "Friends", the 6 main actors all got $1m per episode, which seemed fair in retrospect. Sarah Jessica Parker's salary even reached $3.2m per Sex and the City episode, but she was co-producer. And this was then.

The chart can be divided in quadrants. On the lower-right corner, Charlie Sheen, who is the only one there with an old-school deal.
On the higher-left corner, the work horses - stars of the crime shows with stellar ratings, who could ask for more.
On the very bottom-left, those who are happy to be there. Etc.

Now some stars of shows that get well over 10m viewers get "only" around $100k per episode. So obviously the revenue stream must go somewhere else! my money is on the writers 🙂

Also for fun, I computed the ratio of their salary to the length of each episode. I once calculated that I earned about 83 cents a minute, which sounds pretty ridiculous compared to Charlie Sheen's $62500!

February 11, 2011 by jerome on charts, data visualization, protovis, tips

Working with data in protovis: part 5 of 5

previous: reshaping complex arrays (4/5)

Working with layouts

In this final part, we’re going to look at how we can shape our data to use the protovis built-in layouts such as stacked areas, treemaps or force-directed graphs.
This is not a tutorial on how to use layouts stricto sensu, and I advise anyone interested to first look at the protovis documentation to see what can be done with this and to understand the underlying concepts.

But if there is one thing to know about layouts, it’s that they allow you to create non-trivial visualizations in even less code than regular protovis, provided that you pass them data in a form they can use, and this is precisely where we come in.

Three great categories of layouts

Currently, there are no fewer than 13 types of layouts in Protovis. Fortunately, there are examples for all of them in the gallery.
There are layouts for:

Arrays of data

Grids or heatmaps doc example
Stacked areas or streamgraphs doc stacked areas streamgraphs

Networks (nodes and links) Hierarchized data (trees)

Arc diagrams doc example
Force-directed graphs doc example
Relationship matrices doc example
Rollup networkdoc (no example in Protovis 3.2, but there is one in 3.3)

Trees and hierarchized data

Dendograms doc example
Indented trees doc example
Packed circles doc example
Sunbursts and icicles doc sunbursts icicles
Node-link trees doc example
Treemaps doc example

In addition, there are layouts like pv.Layout.Bullet which require data to have a certain specific shape but the example from the gallery is very explicit. (et tu, Horizon layout).

Arrays of data

In order to work with this kind of layout, the simplest thing is to put your data in a 2-dimensional array:

var data=[
   [8,3,7,2,5],
   [9,6,1,7,4],
    ...
   [7,4,3,6,8]
];

For the grid layout, this gives you an array of cells divided in columns (number of elements in each line) and rows (number of lines).
The idea of the grid layout is that your cells are automatically positioned and sized, so afaik the only thing you can do is add a mark such as a pv.Bar which would fill them completely, but which you could still style with fillStyle or strokeStyle. You can’t really access the underlying data with functions but you can use methods that rely on default values, like adding labels.

For instance, you can use it to generate a QR code:

var qr=[
"000000000000000000000000000",
"011111110001010100011111110",
"010000010101001110010000010",
"010111010000010100010111010",
"010111010111011110010111010",
"010111010010000001010111010",
"010000010110110010010000010",
"011111110101010101011111110",
"000000000011100100000000000",
"011111011110101110101010100",
"000010101001010111101000100",
"010101111001001011111010110",
"001011000100010101010100010",
"001100010111011010010101110",
"010101100110001101001010100",
"010011010011111111100110110",
"010111101010100101000010010",
"010100110010111101111101000",
"000000000101010111000111000",
"011111110100011001010111110",
"010000010000110011000110110",
"010111010110001011111111000",
"010111010101101100110101110",
"010111010100000111001001010",
"010000010111010101101110010",
"011111110101001100011111110",
"000000000000000000000000000",
].map(function(i) i.split(""));

var vis = new pv.Panel()
    .width(216)
    .height(216);
vis.add(pv.Layout.Grid)
    .rows(qr)
 	.cell.add(pv.Bar)
 	    .fillStyle(pv.colors("#fff", "#000"))
     ;
vis.render();

(BTW, this is the QR code to this page)

On line 29, I’m using a map function to turn this array of strings, which is easier and shorter to type, into a bona fide 2-dimensional array.

That’s all there is to grids, of all the layouts they are among the easiest to reproduce with regular protovis.

Now, stacks.
The easiest way to use them is to pass them 2-dimensional arrays. Now it doesn’t have to be arrays of numbers, it can be arrays of associative arrays in case you need to do something exotic. But for the following examples let’s just assume you don’t. Here is how you’d do a stacked area, stacked columns and stacked bars respectively:

var data=[
[[1000,1200,1500,1700]]
[[100,500,300,200]]
]
var vis=new pv.Panel().width(200).height(200);
vis.add(pv.Layout.Stack)
    .layers(data)
    .x(function() 50*this.index)
    .y(function(d) d/20)
    .layer.add(pv.Area)

all you need is to feed the layers, x, y properties of your stack, then say what you want to add to your layers.
Now, columns:

vis.add(pv.Layout.Stack)
    .layers(data)
    .x(function() 50*this.index)
    .y(function(d) d/20)
    .layer.add(pv.Bar).width(40)

and finally, bars:

vis.add(pv.Layout.Stack)
    .layers(data)
    .orient("left")
    .x(function() 50*this.index)
    .y(function(d) d/20)
    .layer.add(pv.Bar).height(40)

For bars, there is a little trick here. I specify that the layer orientation is horizontal (“left”) and I change the height instead of the width of the added pv.Bar.
And that all there is. You can create various streamgraphs by playing with the order and offset properties of the stack but this doesn’t change anything to the data structure, so we’re done here.

Representing networks

Protovis provides 3 cool layouts to easily exhibit relationships between nodes: arc diagrams, matrix diagrams and force-directed layouts.
The good news is that the shape of the data required by those three layouts is identical.

They require an array that correponds to the nodes. This can be as simple as a pv.range(), or as sophisticated as an array of associative arrays if you want to style your network graph according to several potential attributes of the node.

And they also require an array for the links. This array has a more rigid form, it must be an array of associative arrays of the shape: {source: #, target: #, value: #} where the values for source and target correspond to the position of a node in the node array, and value indicates the strength of the link.

So let’s do a simple one.

var nodes=pv.range(6); // why more complex, right?
var links=[
{source:0, target:1, value:2},
{source:1, target:2, value:1},
{source:1, target:3, value:1},
{source:2, target:4, value:4},
{source:3, target:5, value:1},
{source:4, target:5, value:1},
{source:1, target:5, value:3}
]
var vis = new pv.Panel()
    .width(200)
    .height(200)
    ;
var arc = vis.add(pv.Layout.Arc)
    .nodes(nodes)
    .links(links)
	.bottom(100)
arc.link.add(pv.Line);
arc.node.add(pv.Dot)
    .size(50)
vis.render();

Here, by varying the strength of the link, the thickness of the arcs changes accordingly. The nodes are left unstyled, had we passed a more complicated dataset to the nodes array, we could have changed their properties (fillStyle, size, strokeStyle, labels etc.) with appropriate accessor functions.

With little modifications we can create a force-directed layout and a matrix diagram.

var force = vis.add(pv.Layout.Force)
    .nodes(nodes)
    .links(links);

force.link.add(pv.Line);

force.node.add(pv.Dot)
	.size(50)
	.anchor("center").add(pv.Label)
		.text(function() this.index);

vis.render();

Here I labelled the nodes so one can tell which is which. This is done by adding a pv.Label to the pv.Dot that’s attached to the node, just like with any other mark.

var Matrix = vis.add(pv.Layout.Matrix)
	.nodes(nodes)
	.directed(true)
	.links(links)
	.top(20).left(20)

Matrix.link.add(pv.Bar)
    .fillStyle(function(d) pv.Scale.linear(0, 2, 4)
      .range('#eee', 'yellow', 'green')(d.linkValue))

Matrix.label.add(pv.Label).text(function() Math.floor(this.index/2))

vis.render();

For the matrix things are slightly more complex than for the previous 2. Here I opted for a directed matrix, as opposed to a bidirectional one: this means that each link is shown once, to its source from its target, and not twice (ie from its target back to its source) which is the default.
I chose to color the bar attached to my links (which are cells of the matrix) according to the strength of my links. Again, if my nodes field was more qualified, I could have used these properties.

Finally, we’ve added labels to the custom property Matrix.label. Only, the labels are numbered from 0 to 11 so to get numbers from 0 to 5 for both rows and columns I used Math.floor(this.index/2) (integer part of half of this number).

Hierarchized data

Like for networks, the shape of the data we can feed to treemaps, icicles and other hierarchical representation doesn’t change. So once you have your data in order, you can easily switch representations.

Essentially, you will be passing a tree of the form:

var myTree={
   rootnode: {
      node: {
      ...  
         node: {
            leaf: value,
            leaf: value,
            ...
            leaf: value
         },
      ...  
}

The protovis examples use the hierarchy of flare source code as an example, which really shows what can be done with a treemap and other tree represenations.

For our purpose we are going for a simpler tree, inspired by the work of Periscopic on congressspeaks.com which Kim Rees showed at Strata.
Kim presentation featured tiny treemaps that showed the voting record for a congressperson, and whether they had voted for or against their party.

So let’s play with the voting record of an hypothetic congressperson:

var hasVoted={
	didnt: 100,
	voted: {
	    yes: {
	        yesWithParty: 241,
	        yesAgainstParty: 23
	    },
	    no: {
	        noWithParty: 73,
	        noAgainstParty: 5
	    }
	}
};

Once you have your tree, you will need to pass it to your layout using pv.dom, like this:

pv.dom(hasVoted).root("hasVoted").nodes()

Based on that let’s do two hierarchical representations.
Let’s start with a tree:

var vis = new pv.Panel()
    .width(500)
    .height(200)
    ;
var tree = vis.add(pv.Layout.Tree)
    .nodes(pv.dom(hasVoted).root("hasVoted").nodes())
    .depth(40)
    .breadth(100)
    .top(30)
    .right(100)
    ;
tree.link.add(pv.Line);
tree.node.add(pv.Dot)
    .size(function(n) n.nodeValue)
	.anchor("center").add(pv.Label).textAlign("center").text(function(n) n.nodeName)
vis.render();

And here is the result:

There are many styling possibilities obviously left unexplored in this simple example (you can control properties of the tree.link, tree.node, tree.labels which we didn’t use here, etc.), but this won’t change much as far as data are concerned.

Now let’s try a treemap with the same dataset.

var vis = new pv.Panel()
    .width(400)
    .height(200)
    ;

var tree = vis.add(pv.Layout.Treemap)
	.width(200).height(200)
    .nodes(pv.dom(hasVoted).root("hasVoted").nodes())
    ;

tree.leaf.add(pv.Panel)
	.fillStyle(function(d) d.nodeName=="didnt"?"darkgrey":d.nodeName.slice(0,3)=="yes"?
	d.nodeName.slice(-9)=="WithParty"?"powderblue":"steelblue":
	d.nodeName.slice(-9)=="WithParty"?"lightsalmon":"salmon")

vis.add(pv.Panel)
	.data([
		   {label:"yes with party", 	color: "powderblue"},
		   {label:"yes against party", 	color: "steelblue"},
		   {label:"no with party", 		color: "lightsalmon"},
		   {label:"no against party", 	color: "salmon"},
		   {label:"didn't vote", 		color: "darkgrey"}
		   ])
	.left(220)
	.top(function() 50+20*this.index)
	.height(15)
	.width(20)
	.fillStyle(function(d) d.color)
	.anchor("right").add(pv.Label).textAlign("left").text(function(d) d.label)

vis.render();

and what took the longest part of the code was making the legend.

Here is the outcome:

February 11, 2011 by jerome on charts, data visualization, protovis, tips

Working with data in protovis – part 4 of 5

Previous: array functions in javascript and protovis

Reshaping complex arrays

This really is what protovis data processing is all about.
In today’s tutorial, I am going to refer extensively to protovis’s Becker’s Barley example. One reason for that is that it’s also used in the API documentation of the methods we are going to cover, and also because I am posting a line-by-line explanation of this example that you can refer to.

So far we’ve seen that :

Associative arrays are great as data elements, as their various values can be used for various attributes.
For instance, if the current data element is an associative array of this shape:
```
{ yield: 27.00000, variety: "Manchuria", year: 1931, site: "University Farm" }
```
one could imagine a bar chart where the length of the bar would come from the yield, their fillStyle color from the variety, the label from the site, etc.

arrays of associative arrays are very practical to manipulate thanks to accessor functions.
An array of the shape:

var barley = [
  { yield: 27.00000, variety: "Manchuria", year: 1931, site: "University Farm" },
  { yield: 48.86667, variety: "Manchuria", year: 1931, site: "Waseca" },
  { yield: 27.43334, variety: "Manchuria", year: 1931, site: "Morris" },
  { yield: 39.93333, variety: "Manchuria", year: 1931, site: "Crookston" },
  { yield: 32.96667, variety: "Manchuria", year: 1931, site: "Grand Rapids" },
  { yield: 28.96667, variety: "Manchuria", year: 1931, site: "Duluth" },
  { yield: 43.06666, variety: "Glabron", year: 1931, site: "University Farm" },
  { yield: 55.20000, variety: "Glabron", year: 1931, site: "Waseca" },
  { yield: 28.76667, variety: "Glabron", year: 1931, site: "Morris" },
  { yield: 38.13333, variety: "Glabron", year: 1931, site: "Crookston" },
  { yield: 29.13333, variety: "Glabron", year: 1931, site: "Grand Rapids" },
  { yield: 29.66667, variety: "Glabron", year: 1931, site: "Duluth" },
  { yield: 35.13333, variety: "Svansota", year: 1931, site: "University Farm" },
…
  { yield: 29.33333, variety: "Wisconsin No. 38", year: 1932, site: "Duluth" }
]

could be easily sorted according to any of the keys – yield, variety, year, site, etc.

it is easy to access the data of an element’s parent, and in some cases it can greatly simplify the code.

Nesting

For this last reason, you may want to turn one flat array of associative arrays into an array of arrays of associative arrays. This process is called nesting.

Simple nesting

If you turn a single array like the one on the left-hand side to an array of arrays like on the right-hand side, you could easily do 3 smaller charts, one next to the other, by creating them inside of panels. You could have some information at this panel level (for instance the variety) and the rest at a lower level.

Fortunately, there are protovis methods that can turn your flat list into a more complex array of arrays. And since protovis methods are meant to be chained, you can even go on and create arrays of arrays of arrays of arrays if needs be.
Even better – combined with the other data processing functions, you don’t only change the structure of your array, but you can also filter and control the order of the elements to show everything that you want and only what you want.

And how complicated can this be?
To do the above, all you have to type is

barley=pv.nest(barley).key(function(d) d.variety).entries();

What this does is that it nests your barley array, according to the variety key. entries() at the end is required to obtain the array of arrays needed.

Here is an example of what can be done with both kinds of data structures in just a few lines of code (which won’t include the data proper. The long, flat array is stored in the variable barley, as above).
Without nesting:

var vis = new pv.Panel()
    .width(600)
    .height(200)
;
vis.add(pv.Bar)
	.data(barley)
	.left(function() this.index*5)
	.width(4)
	.bottom(0)
	.height(function(d) d.yield*2)
	.fillStyle(function(d) d.year==1931?"steelBlue":"Orange")
;
vis.render();

As the pv.Bar goes through the array, there is not much it can do to structure it. We can just size the bars according to the value of yield, and color them according to another key (here the year).

Now using nesting:

barley2=pv.nest(barley).key(function(d) d.variety).entries();
barley2=pv.nest(barley).key(function(d) d.variety).entries();
var vis = new pv.Panel()
    .width(710)
    .height(150)
;
var cell=vis.add(pv.Panel)
	.data(barley2)
	.left(function() this.index*70)
	.width(70)
	.top(0)
	.height(150);
cell.anchor("top").add(pv.Label).textAlign("center").font("9px sans-serif").text(function(d) d.key)
cell.add(pv.Bar)
		.data(function(d) d.values)
		.left(function() 5+this.index%6*10)
		.width(8)
		.bottom(0)
		.height(function(d) d.yield*2)
		.fillStyle(function(d) d.year==1931?"steelBlue":"Orange")
		.add(pv.Label).text(function(d) d.site).textAngle(-Math.PI/2)
			.textAlign("left").left(function() 15+this.index%6*10).textStyle("#222")
	;
vis.render();

Here we used the same simple nesting command as above (the original example uses a more complicated one which allows for more refinement in the display). This structure allows us to create first panels, which we can style by displaying the name of the sites for instance, then, within these panels, the corresponding bars.

Doing this with the data in its original form would have been possible, but would have required writing a much longer program. So the whole idea of nesting is to take some time to plan the data structure once, so that the code is as short and useful as possible.

Going further with nesting

However, it is possible to go beyond that:

by nesting further, which can be done by adding other .key() methods:

barley=pv.nest(barley)
  .key(function(d) d.variety)
  .key(function(d) d.year)
  .entries();

And/or

by sorting keys or values using the sortKeys() and sortValues() methods, respectively.

For instance, we can change the order in which the variety blocks are displayed with sortKeys():
barley=pv.nest(barley) .key(function(d) d.variety) .entries();
barley=pv.nest(barley) .key(function(d) d.variety) .sortKeys() .entries();

By using sortKeys without argument, the natural order is used (alphabetical, since our key is a string). But we could provide a comparison function if we wanted a more sophisticated arrangement.

Nesting and hierarchy

If you run the double nesting command we discussed above,

barley=pv.nest(barley)
  .key(function(d) d.variety)
  .key(function(d) d.year)
  .entries();

you’ll get as a result something of the form:

var barley=[
  {key:"Manchuria", values: [
    {key:"1931", values: [
      {site:"University farm", variety: "Manchuria", year: 1931, yield: 27},
      {site:"Waseca", variety: "Manchuria", year: 1931, yield: 48.86667},
      {site:"Morris", variety: "Manchuria", year: 1931, yield: 27.43334},
      {site:"Crookston", variety: "Manchuria", year: 1931, yield: 39.93333},
      {site:"Grand Rapids", variety: "Manchuria", year: 1931, yield: 32.96667},
      {site:"Duluth", variety: "Manchuria", year: 1931, yield: 28.96667}
    ]},
    {key:"1932", values: [  
      {site:"University farm", variety: "Manchuria", year: 1932, yield: 26.9},
      {site:"Waseca", variety: "Manchuria", year: 1932, yield: 33.46667},
      {site:"Morris", variety: "Manchuria", year: 1932, yield: 34.36666},
      {site:"Crookston", variety: "Manchuria", year: 1932, yield: 32.96667},
      {site:"Grand Rapids", variety: "Manchuria", year: 1932, yield: 22.13333},
      {site:"Duluth", variety: "Manchuria", year: 1932, yield: 22.56667}
    ]}
  ]},
  {key: "Glabron", ...
]

and so on and so forth for all the varieties of barley. Now how can we use this structure in a protovis script? why not use multi-dimensional arrays instead, and if so, how would the code change?

Well. You’d start using this structure by creating a first panel and and passing it the nested structure as data.
Assuming your root panel is called vis, you’d start likewise:

vis.add(pv.Panel)
    .data(barley)

now, since barley has been nested first by variety, it is now an array of 10 elements. You are going to create 10 individual panels. At some point you should worry about dimensioning and positioning them. But here we are only focusing on passing data to subsequent elements.

Next, you are going to create another set of panels (or any mark, really, this doesn’t change anything for the data)

.add(pv.Panel)
    .data(function(d) d.values)

This is how you drill down to the next level of data, by using an accessor function with the key “values”.
Congratulations! you have created 2 panels in each of our 10 individual panels, one per year.

Finally, you are going to create a final mark (let’s say, a pv.Bar)

.add(pv.Bar)
    .data(function(d) d.values)

Again, you use an accessor function of the same form. This will create a bar chart with 6 bars.
The data element corresponding to each bar is of the form:

{site:"University farm", variety: "Manchuria", year: 1932, yield: 26.9}

So, when you style the chart, you can access these properties with accessor functions, and write for instance:

    .height(function(d) d.yield)
    .add(pv.Label).text(function(d) d.variety)

etc.

To sum it up: you can create a hierarchical structure in protovis that corresponds to the shape of your nested array by adding elements and passing data using an accessor function with the key “values”.
At the lowest level of your structure you can access all the properties of the original array using accessor functions.

Now, what if instead we used a multi-dimensional, normal array without keys and values? don’t they have structure and hierarchy, too?
This is not only possible, but also advised when your dataset is getting really big, as you would plague your users with annoying loading times. This changes the structure of the code though.

An equivalent multi-dimensional array would be something like:

var yields =
[ // this is the level of the variety
  [ // this is the level of the year
    [ 27, 48.86667, 27.43334, 39.93333, 32.96667, 28.96667], 
    [ 26.9, 33.46667, 34.36666, 32.96667, 22.13333, 22.56667]
  ],
  [ 
    [ 43.06666, 55.2, 28.76667, 38.13333, 29.13333, 29.66667],
    [ 36.8, 37.73333, 35.13333, 26.16667, 14.43333, 25.86667]
  ],
  [
    [ 35.13333, 47.33333, 25.76667, 40.46667, 29.66667, 25.7],
    [ 27.43334, 38.5, 35.03333, 20.63333, 16.63333, 22.23333]
  ],
  [
    [ 39.9, 50.23333, 26.13333, 41.33333, 23.03333, 26.3],
    [ 26.8, 37.4, 38.83333, 32.06666, 32.23333, 22.46667]
  ],
  [
    [ 36.56666, 63.8333, 43.76667, 46.93333, 29.76667, 33.93333],
    [ 29.06667, 49.2333, 46.63333, 41.83333, 20.63333, 30.6]
  ],
  [
    [ 43.26667, 58.1, 28.7, 45.66667, 32.16667, 33.6],
    [ 26.43334, 42.2, 43.53334, 34.33333, 19.46667, 22.7]
  ],
  [
    [ 36.6, 65.7667, 30.36667, 48.56666, 24.93334, 28.1],
    [ 25.56667, 44.7, 47, 30.53333, 19.9, 22.5]
  ],
  [
    [ 32.76667, 48.56666, 29.86667, 41.6, 34.7, 32],
    [ 28.06667, 36.03333, 43.2, 25.23333, 26.76667, 31.36667]
  ],
  [
    [ 24.66667, 46.76667, 22.6, 44.1, 19.7, 33.06666],
    [ 30, 41.26667, 44.23333, 32.13333, 15.23333, 27.36667]
  ],
  [
    [ 39.3, 58.8, 29.46667, 49.86667, 34.46667, 31.6],
    [ 38, 58.16667, 47.16667, 35.9, 20.66667, 29.33333]
  ]
]

and that’s the whole lot. It is indeed shorter. now this array is only the yields, you may want to create an array of the possible values of varieties, sites and years for good measure.

var varieties=["Manchuria", "Glabron", "Svansota", "Velvet", "Trebi",
     "No. 457", "No. 462", "Peatland", "No. 475", "Wisconsin No. 38"], 
    sites=["University Farm", "Waseca", "Morris", "Crookston", "Grand Rapids", "Duluth"],
    years=[1931,1932];

And by the way, it is very possible to create these arrays out of the original array using the map() method or equivalent.

how can we create an equivalent structure?
we start like the above:

vis.add(pv.Panel)
     .data(yields)

Likewise, our 3-dimensional array is really an array of 10 arrays of 2 arrays of 6 elements. So we are also creating 10 panels. Let’s continue and create panels for the years:

.add(pv.Panel)
    .data(function(d) d)

To drill down one level in an array, you have to use this form. you say that you are giving the children of your object what’s inside the data property of their parent.

So naturally, you follow by

.add(pv.Bar)
    .data(function(d) d)

now how you style your bars will be slightly different than before. What you passed your first panel was an array of yields. So that’s what you get now from your data. If you want something else, you’ll have to get it with this.index for instance.

    .height(function(d) d) // that's the yield
    .add(pv.Label).text(function() varieties[this.index])

All in all it’s trickier to work with arrays. The code is less explicit, and if you change one array even by accident, you’ll have to check that others are still synchronized. But it could make your vis much faster.

Aggregating

Sometimes, what you want out of an array is not a more complex array, but a simpler list of numbers. For instance, what if you could obtain the sum of all the values in the array for such or such property? This is also possible in protovis, and in fact, it looks a lot like what we’ve done. The difference is that instead of using the method entries(), we will use the method rollup().

Let’s suppose we have a flat array that looks like this: these are scores of students on 3 exams.

var scores=[
{student:"Adam", exam:1, score: 77},
{student:"Adam", exam:2, score: 34},
{student:"Adam", exam:3, score: 85},
{student:"Barbara", exam:1, score: 92},
{student:"Barbara", exam:2, score: 68},
{student:"Barbara", exam:3, score: 97},
{student:"Connor", exam:1, score: 84},
{student:"Connor", exam:2, score: 54},
{student:"Connor", exam:3, score: 37},
{student:"Daniela", exam:1, score: 61},
{student:"Daniela", exam:2, score: 58},
{student:"Daniela", exam:3, score: 64}
]

Now, we would like to get, in one simple object, the average for each student.
We know we could reshape the array if we wanted by using pv.Nest and entries():

pv.nest(scores).key(function(d) d.student).entries()

This would be something of the shape:

[{key:"Adam", values:[
    {exam:1, score: 77, student: "Adam"},
    {exam:2, score: 34, student: "Adam"},
    {exam:3, score: 85, student: "Adam"}
    ]
  },
  {key:"Barbara", values:[
    {exam:1, score: 92, student: "Barbara"},
    {exam:2, score: 68, student: "Barbara"},
    {exam:3, score: 97, student: "Barbara"}
   ]
  },
etc.

Useful, for instance, if we’d want to chart the progress of each student separately.

Now if instead of using entries() at the end, we use rollup(), we could get this:

{
Adam: 65.33333333333333
Barbara: 85.66666666666667
Connor: 58.333333333333336
Daniela: 61}

The exact statement is

pv.nest(scores)
  .key(function(d) d.student)
  .rollup(function(data) pv.mean(data, function(d) d.score))

To understand how this works, it helps to visualize what the pv.nest would have returned if we had asked for entries.
What rollup does is that it would go through each of the values that correspond to the keys, and return one aggregate value, depending on the function.

For the first student, “Adam”, the corresponding values array is like this:

[
    {exam:1, score: 77, student: "Adam"},
    {exam:2, score: 34, student: "Adam"},
    {exam:3, score: 85, student: "Adam"}
    ]

so rollup will just look at each element of this array and apply the function.
This is what (data) in “function(data)” corresponds to.
Next, we tell protovis what to do with these elements. Here, we are interested in the average, so we take pv.mean (not pv.average, remember?)
However, we can’t directly compute the average of an array of associative arrays – we must tell protovis exactly what to average. This is why we use an accessor function, function(d) d.score.

Of course, pv.mean used in this example can be replaced by just about any function.

In the name of clarity, especially if there is only one property that can be aggregated, you can declare a function outside of the rollup() method. This is useful if you are going to aggregate your array by different dimensions:

function meanScore(data) pv.mean(data, function(d) d.score);
var avgStudent=pv.nest(scores)
  .key(function(d) d.student)
  .rollup(meanScore);
var avgExam=pv.nest(scores)
  .key(function(d) d.exam)
  .rollup(meanScore);

Flattening

Protovis also provides methods that turn a “nested” array back into a flat array. And methods that turn a normal array into a tree.
The main advantage of having a flat array is that you can nest it in a different way. This is useful, for instance, if you got your data in a nested form that doesn’t work for you. Likewise, a tree is easier to reshape in protovis than an array.

To create a flat array out of a nested one, you have to use pv.flatten and specify all the keys of the array and conclude by array().

barley=pv.flatten(barley).key("variety").key("site").key("year").key("yield").array()

It’s important to note that you need to specify all the keys, not just the keys that correspond to actual nesting. So again, if you start from a flat array, and you do

barley=pv.nest(barley).key("variety").entries()

to reverse this, you’ll have to enter the full formula, using key four times:

barley=pv.flatten(barley).key("variety").key("site").key("year").key("yield").array()

Finally, pv.tree – well, I haven’t seen this method used outside the documentation. It’s not used in any live example, not covered by any question in the forum, and I haven’t found any trace of it in the internet. So I’d rather leave you with the explanation in the documentation which is fairly clear than come up with my own. If you find yourself in a situation like in the documentation, where you have a flat array of associative arrays, which have a property that could be interpreted as a hierarchy, then you could use this method to turn your array in something more useful to protovis.

Putting it all together

Instead of coming up with a specific new example for this section I refer you to my explanation of the Becker’s Barley example.
On the same subject, see a comparison of how to re-create Becker’s Barley with protovis and Tableau

next: working with layouts

barley=pv.nest(barley) .key(function(d) d.variety) .entries();	barley=pv.nest(barley) .key(function(d) d.variety) .sortKeys() .entries();