Leave-one-out subset for your ggplot smoother

I have a dashboard at work that plots the number of contracts my office handles over time, and I wanted to add a trendline to show growth. However, trendline on the entire dataset skews low because our fiscal year just restarted. It took a little trial and error to figure out how to exclude the most recent year’s count so the trendline is more accurate for this purpose, so I thought I’d share my solution in a short post.

Here’s the code with a dummy dataset, with both the original trendline and the leave-one-out version:

x <- data.frame(
  x = seq(1:10),
  y = c(1,1,2,3,5,8,13,21,34,7)
)

x %>%
  arrange(desc(x)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_col() + 
  geom_smooth(method = "lm", se = F, color = "red") +
  geom_smooth(method = "lm", data = function(x) { slice(x, -1) }, se = F, fullrange = TRUE)

The magic happens by defining an anonymous function as the data argument for the geom_smooth function. I’m using the slice function to drop the first element of the data frame provided to geom_smooth.

There are two other important tips here. First, be sure and sort the data in the tidyverse pipe before passing it to ggplot — for my purpose, this was in descending date order because I wanted to drop the most recent entry. The other tip is an aesthetic one, which is to use the fullrange = TRUE argument in order to plot the trendline into the current date column to provide a rough prediction for the current date period.

NOTE: I’ve seen some commentary that the decision to omit elements as described here should be explicitly disclosed. However, I think it’s fairly apparent in this specific application that the extension of the smoother is predictive in nature, rather than a true representation of the last data element. I could probably shade or apply an alpha to the most recent data point using a similar technique to make this even more clear, but the purpose here was to demonstrate the leave-one-out technique.

What do you think of this solution?

3 Comments

  1. Michael

    Hi Nathan

    I think that the R code is neat, but the visualization is misleading.

    The bars give the visual impression that they represent numbers, but actually they represent rates: events per unit of time. The units of time are identical for all the bars, except the last one, which should be shown with a narrower width, For example, if the units of time are years, and the last count is for 3 months, the last bar should be 3/12ths the width of the others, and the height 12/3 of what is showing now (7 in your example).

    • You’re right about what the data is intended to represent, and thank you for the interesting solution for this example. In the real world use, there are more than 10 bars, and altering the width of the last bar doesn’t convey the period information in pre-attentive fashion.

      Also, I’m using bars instead of other geoms so I can color by type and stack, which eliminates some other options like geom_line or geom_area.

      Another solution I happened to see to the misleading problem is traffic stats on WordPress mobile, which uses a projection for the remainder of the period that’s shown in a faded color.

  2. Aleem Juma

    Providing a function to ggplot as data is interesting but I’m not sure it has a place with this particular use case since subsetting is much more efficiently done by the standard square brackets approach i.e.

    data = x[x!=max(x$x)]

    Also if you’re plotting time periods your time axis should only include *complete* periods. If you want to show current YTD growth you can do so by comparing this month to the same month in prior years. The visual should make it easy for the consumer to compare like for like.

Leave a Reply

Your email address will not be published. Required fields are marked *