Every layer must have some data associated with it, and that data must be in a tidy data frame. Tidy data frames are described in more detail in R for Data Science (https://r4ds.had.co.nz), but for now, all you need to know is that a tidy data frame has variables in the columns and observations in the rows. This is a strong restriction, but there are good reasons for it:
Your data is very important, so it’s best to be explicit about it.
A single data frame is also easier to save than a multitude of vectors, which means it’s easier to reproduce your results or send your data to someone else.
It enforces a clean separation of concerns: ggplot2 turns data frames into visualisations. Other packages can make data frames in the right format.
The data on each layer doesn’t need to be the same, and it’s often useful to combine multiple datasets in a single plot. To illustrate that idea I’m going to generate two new datasets related to the mpg dataset. First I’ll fit a loess model and generate predictions from it. (This is what
geom_smooth() does behind the scenes)
loess(hwy ~ displ, data = mpg) mod <- data_frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50)) grid <-#> Warning: `data_frame()` is deprecated as of tibble 1.1.0. #> Please use `tibble()` instead. #> This warning is displayed once every 8 hours. #> Call `lifecycle::last_warnings()` to see where this warning was generated. $hwy <- predict(mod, newdata = grid) grid grid#> # A tibble: 50 x 2 #> displ hwy #> <dbl> <dbl> #> 1 1.6 33.1 #> 2 1.71 32.2 #> 3 1.82 31.3 #> 4 1.93 30.4 #> 5 2.04 29.6 #> 6 2.15 28.8 #> # … with 44 more rows
Next, I’ll isolate observations that are particularly far away from their predicted values:
resid(mod) / mod$s std_resid <- filter(mpg, abs(std_resid) > 2) outlier <- outlier#> # A tibble: 6 x 11 #> manufacturer model displ year cyl trans drv cty hwy fl class #> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> #> 1 chevrolet corvet… 5.7 1999 8 manual… r 16 26 p 2seater #> 2 pontiac grand … 3.8 2008 6 auto(l… f 18 28 r midsize #> 3 pontiac grand … 5.3 2008 8 auto(s… f 16 25 p midsize #> 4 volkswagen jetta 1.9 1999 4 manual… f 33 44 d compact #> 5 volkswagen new be… 1.9 1999 4 manual… f 35 44 d subcom… #> 6 volkswagen new be… 1.9 1999 4 auto(l… f 29 41 d subcom…
I’ve generated these datasets because it’s common to enhance the display of raw data with a statistical summary and some annotations. With these new datasets, I can improve our initial scatterplot by overlaying a smoothed line, and labelling the outlying points:
ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_line(data = grid, colour = "blue", size = 1.5) + geom_text(data = outlier, aes(label = model))
(The labels aren’t particularly easy to read, but you can fix that with some manual tweaking.)
Note that you need the explicit
data = in the layers, but not in the call to
ggplot(). That’s because the argument order is different. This is a little inconsistent, but it reduces typing for the common case where you specify the data once in
ggplot() and modify aesthetics in each layer.
In this example, every layer uses a different dataset. We could define the same plot in another way, omitting the default dataset, and specifying a dataset for each layer:
ggplot(mapping = aes(displ, hwy)) + geom_point(data = mpg) + geom_line(data = grid) + geom_text(data = outlier, aes(label = model))
I don’t particularly like this style in this example because it makes it less clear what the primary dataset is (and because of the way that the arguments to
ggplot() are ordered, it actually requires more keystrokes). However, you may prefer it in cases where there isn’t a clear primary dataset, or where the aesthetics also vary from layer to layer.
The first two arguments to ggplot are
mapping. The first two arguments to all layer functions are
data. Why does the order of the arguments differ? (Hint: think about what you set most commonly.)
The following code uses dplyr to generate some summary statistics about each class of car.
library(dplyr) mpg %>% class <- group_by(class) %>% summarise(n = n(), hwy = mean(hwy)) #> `summarise()` ungrouping output (override with `.groups` argument)
Use the data to recreate this plot: