I have a dashboard at work that plots the number of contracts my office handles over time, and I wanted to add a trendline to show growth. However, trendline on the entire dataset skews low because our fiscal year just restarted. It took a little trial and error to figure out how to exclude the most recent year’s count so the trendline is more accurate for this purpose, so I thought I’d share my solution in a short post.
Here’s the code with a dummy dataset, with both the original trendline and the leave-one-out version:
x <- data.frame(
x = seq(1:10),
y = c(1,1,2,3,5,8,13,21,34,7)
)
x %>%
arrange(desc(x)) %>%
ggplot(aes(x = x, y = y)) +
geom_col() +
geom_smooth(method = "lm", se = F, color = "red") +
geom_smooth(method = "lm", data = function(x) { slice(x, -1) }, se = F, fullrange = TRUE)
The magic happens by defining an anonymous function as the data argument for the geom_smooth function. I’m using the slice function to drop the first element of the data frame provided to geom_smooth.
There are two other important tips here. First, be sure and sort the data in the tidyverse pipe before passing it to ggplot — for my purpose, this was in descending date order because I wanted to drop the most recent entry. The other tip is an aesthetic one, which is to use the fullrange = TRUE argument in order to plot the trendline into the current date column to provide a rough prediction for the current date period.
NOTE: I’ve seen some commentary that the decision to omit elements as described here should be explicitly disclosed. However, I think it’s fairly apparent in this specific application that the extension of the smoother is predictive in nature, rather than a true representation of the last data element. I could probably shade or apply an alpha to the most recent data point using a similar technique to make this even more clear, but the purpose here was to demonstrate the leave-one-out technique.
What do you think of this solution?