Leave-one-out subset for your ggplot smoother

I have a dashboard at work that plots the number of contracts my office handles over time, and I wanted to add a trendline to show growth. However, trendline on the entire dataset skews low because our fiscal year just restarted. It took a little trial and error to figure out how to exclude the most recent year’s count so the trendline is more accurate for this purpose, so I thought I’d share my solution in a short post.

Here’s the code with a dummy dataset, with both the original trendline and the leave-one-out version:

x <- data.frame(
  x = seq(1:10),
  y = c(1,1,2,3,5,8,13,21,34,7)
)

x %>%
  arrange(desc(x)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_col() + 
  geom_smooth(method = "lm", se = F, color = "red") +
  geom_smooth(method = "lm", data = function(x) { slice(x, -1) }, se = F, fullrange = TRUE)

The magic happens by defining an anonymous function as the data argument for the geom_smooth function. I’m using the slice function to drop the first element of the data frame provided to geom_smooth.

There are two other important tips here. First, be sure and sort the data in the tidyverse pipe before passing it to ggplot — for my purpose, this was in descending date order because I wanted to drop the most recent entry. The other tip is an aesthetic one, which is to use the fullrange = TRUE argument in order to plot the trendline into the current date column to provide a rough prediction for the current date period.

NOTE: I’ve seen some commentary that the decision to omit elements as described here should be explicitly disclosed. However, I think it’s fairly apparent in this specific application that the extension of the smoother is predictive in nature, rather than a true representation of the last data element. I could probably shade or apply an alpha to the most recent data point using a similar technique to make this even more clear, but the purpose here was to demonstrate the leave-one-out technique.

What do you think of this solution?

Back to Top