19.2 The build step

ggplot_build(), as discussed above, takes the declarative representation constructed with the public API and augments it by preparing the data for conversion to graphic primitives.

19.2.1 Data preparation

The first part of the processing is to get the data associated with each layer and get it into a predictable format. A layer can either provide data in one of three ways: it can supply its own (e.g., if the data argument to a geom is a data frame), it can inherit the global data supplied to ggplot(), or else it might provide a function that returns a data frame when applied to the global data. In all three cases the result is a data frame that is passed to the plot layout, which orchestrates coordinate systems and facets. When this happens the data is first passed to the plot coordinate system which may change it (but usually doesn’t), and then to the facet which inspects the data to figure out how many panels the plot should have and how they should be organised. During this process the data associated with each layer will be augmented with a PANEL column. This column will (must) be kept throughout the rendering process and is used to link each data row to a specific facet panel in the final plot.

The last part of the data preparation is to convert the layer data into calculated aesthetic values. This involves evaluating all aesthetic expressions from aes() on the layer data. Further, if not given explicitly, the group aesthetic is calculated from the interaction of all non-continuous aesthetics. The group aesthetic is, like PANEL a special column that must be kept throughtout the processing. As an example, the plot p created earlier contains only the one layer specified by geom_point() and at the end of the data preparation process the first 10 rows of the data associated with this layer look like this:

#>      x  y colour PANEL group
#> 1  1.8 29      f     1     2
#> 2  1.8 29      f     1     2
#> 3  2.0 31      f     2     2
#> 4  2.0 30      f     2     2
#> 5  2.8 26      f     1     2
#> 6  2.8 26      f     1     2
#> 7  3.1 27      f     2     2
#> 8  1.8 26      4     1     1
#> 9  1.8 25      4     1     1
#> 10 2.0 28      4     2     1

19.2.2 Data transformation

Once the layer data has been extracted and converted to a predictable format it undergoes a series of transformations until it has the format expected by the layer geometry.

The first step is to apply any scale transformations to the columns in the data. It is at this stage of the process that any argument to trans in a scale has an effect, and all subsequent rendering will take place in this transformed space. This is the reason why setting a position transform in the scale has a different effect than setting it in the coordinate system. If the transformation is specified in the scale it is applied before any other calculations, but if it is specified in the coordinate system the transformation is applied after those calculations. For instance, our original plot p involves no scale transformations so the layer data remain untouched at this stage. The first three rows are shown below:

#>     x  y colour PANEL group
#> 1 1.8 29      f     1     2
#> 2 1.8 29      f     1     2
#> 3 2.0 31      f     2     2

In contrast, if our plot object is p + scale_x_log10() and we inspect the layer data at this point in processing, we see that the x variable has been transformed appropriately:

#>       x  y colour PANEL group
#> 1 0.255 29      f     1     2
#> 2 0.255 29      f     1     2
#> 3 0.301 31      f     2     2

The second step in the process is to map the position aesthetics using the position scales, which unfolds differently depending on the kind of scale involved. For continuous position scales – such as those used in our example – the out of bounds function specified in the oob argument (Section 9.1.1) is applied at this point, and NA values in the layer data are removed. This makes little difference for p, but if we were plotting p + xlim(2, 8) instead the oob function – scales::censor() in this case – would replace x values below 2 with NA as illustrated below:

#> Warning: Removed 22 rows containing non-finite values (stat_smooth).
#>    x  y colour PANEL group
#> 1 NA 29      f     1     2
#> 2 NA 29      f     1     2
#> 3  2 31      f     2     2

For discrete positions the change is more radical, because the values are matched to the limits values or the breaks specification provided by the user, and then converted to integer-valued positions. Finally, for binned position scales the continuous data is first cut into bins using the breaks argument, and the position for each bin is set to the midpoint of its range. The reason for performing the mapping at this stage of the process is consistency: no matter what type of position scale is used, it will look continuous to the stat and geom computations. This is important because otherwise computations such as dodging and jitter would fail for discrete scales.

At the third stage in this transformation the data is handed to the layer stat where any statistical transformation takes place. The procedure is as follows: first, the stat is allowed to inspect the data and modify its parameters, then do a one off preparation of the data. Next, the layer data is split by PANEL and group, and statistics are calculated before the data is reassembled.⁴⁷ Once the data has been reassembled in its new form it goes through another aesthetic mapping process. This is where any aesthetics whose computation has been delayed using stat() (or the old ..var.. notation) get added to the data. Notice that this is why stat() expressions – including the formula used to specify the regression model in the geom_smooth() layer of our example plot p – cannot refer to the original data. It simply doesn’t exist at this point.

As an example consider the second layer in our plot, which produces the linear regressions. Before the stat computations have been performed the data for this layer simply contain the coordinates and the required PANEL and group columns.

#>     x  y colour PANEL group
#> 1 1.8 29      f     1     2
#> 2 1.8 29      f     1     2
#> 3 2.0 31      f     2     2

After the stat computations have taken place, the layer data are changed considerably:

#>      x    y ymin ymax    se flipped_aes colour PANEL group
#> 1 1.80 24.3 23.1 25.6 0.625       FALSE      4     1     1
#> 2 1.86 24.2 22.9 25.4 0.612       FALSE      4     1     1
#> 3 1.92 24.0 22.8 25.2 0.598       FALSE      4     1     1

At this point the geom takes over from the stat (almost). The first action it takes is to inspect the data, update its parameters and possibly make a first pass modification of the data (same setup as for stat). This is possibly where some of the columns gets reparameterised e.g. x+width gets changed to xmin+xmax. After this the position adjustment gets applied, so that e.g. overlapping bars are stacked, etc. For our example plot p, it is at this step that the jittering is applied in the first layer of the plot and the x and y coordinates are perturbed:

#>      x    y colour PANEL group
#> 1 1.79 29.1      f     1     2
#> 2 1.79 29.3      f     1     2
#> 3 1.99 30.7      f     2     2

Next—and perhaps surprisingly—the position scales are all reset, retrained, and applied to the layer data. Thinking about it, this is absolutely necessary because, for example, stacking can change the range of one of the axes dramatically. In some cases (e.g., in the histogram example above) one of the position aesthetics may not even available until after the stat computations and if the scales were not retrained it would never get trained.

The last part of the data transformation is to train and map all non-positional aesthetics, i.e. convert whatever discrete or continuous input that is mapped to graphical parameters such as colours, linetypes, sizes etc. Further, any default aesthetics from the geom are added so that the data is now in a predictable state for the geom. At the very last step, both the stat and the facet gets a last chance to modify the data in its final mapped form with their finish_data() methods before the build step is done. For the plot object p, the first few rows from final state of the layer data look like this:

#>    colour    x    y PANEL group shape size fill alpha stroke
#> 1 #00BA38 1.76 29.0     1     2    19  1.5   NA    NA    0.5
#> 2 #00BA38 1.83 29.0     1     2    19  1.5   NA    NA    0.5
#> 3 #00BA38 1.98 31.3     2     2    19  1.5   NA    NA    0.5

19.2.3 Output

The return value of ggplot_build() is a list structure with the ggplot_built class. It contains the computed data, as well as a Layout object holding information about the trained coordinate system and faceting. Further it holds a copy of the original plot object, but now with trained scales.

It is possible for a stat to circumvent this splitting by overwritting specific compute_*() methods and thus do some optimisation.↩︎