Is GINA really about to die?!?

Introduction

During a recent negotiation of an informed consent form for use in a clinical trial, the opposing lawyer and I skirmished over the applicability of the Genetic Information Nondiscrimination Act of 2008, commonly known as GINA. Specifically, the opposing lawyer thought that guidance issued by the U.S. Office for Human Research Protections in 2009 was now outdated, in part because enforcement efforts were erratic. The argument was primarily driven by policy, rather than data.

Being a data-driven guy, I wanted to see whether the data supported the argument advanced by the other lawyer. Fortunately, the U.S. Equal Employment Opportunity Commission (EEOC), which is responsible for administering GINA complaints, maintains statistics regarding GINA claims and resolutions. I’m not great at making sense of numbers in a table, so I thought this presented the perfect opportunity to rvest some data!

libraries <- c("tidyverse", "rvest", "magrittr")
lapply(libraries, require, character.only = TRUE)

Data Scraping

In standard rvest fashion, we’ll read a url, extract the table containing the GINA enforcement statistics, and then do some data cleaning. Once we read the table and gather all of the annual results into key/pair of year/value, we get the following results:

url <- "https://www.eeoc.gov/eeoc/statistics/enforcement/genetic.cfm"

GINA.resolutions <- read_html(url) %>%
  html_nodes("table") %>% 
  extract2(1) %>%
  html_table(trim = TRUE, fill = TRUE, header = TRUE) 

names(GINA.resolutions)[1] <- "metric" # Top left table cell is blank, will throw errors
names(GINA.resolutions) <- gsub("FY (.+)", "\\1", names(GINA.resolutions)) # Remove FY from year so we can convert to numeric

GINA.resolutions <- GINA.resolutions %>%
  filter(! metric == "") %>% # Remove percentage rows
  filter(! metric == "Resolutions By Type") %>% # Remove blank line
  gather(year, value, 2:9) %>% # short and wide data to tall and skinny
  mutate(
    year = as.integer(year),
    value = gsub("[\\$\\%]", "", value)
  ) %>%
  mutate(
    value = as.numeric(value)
  ) %>%
  as.tibble()

GINA.resolutions
## # A tibble: 88 x 3
##    metric                      year value
##    <chr>                      <int> <dbl>
##  1 Receipts                    2010   201
##  2 Resolutions                 2010    56
##  3 Settlements                 2010     3
##  4 Withdrawals w/Benefits      2010     2
##  5 Administrative Closures     2010    11
##  6 No Reasonable Cause         2010    38
##  7 Reasonable Cause            2010     2
##  8 Successful Conciliations    2010     1
##  9 Unsuccessful Conciliations  2010     1
## 10 Merit Resolutions           2010     7
## # ... with 78 more rows

Claim Numbers over Time

Now that we have the data in a format we can use, we’ll look at the volume of claims and resolutions over time:

GINA.resolutions %>%
  filter(metric == "Receipts" | metric == "Resolutions") %>%
  ggplot(aes(year, value, color = metric)) +
  geom_line() +
  labs(
    title = "EEOC Enforcement of GINA Charges",
    subtitle = "Claims and Resolutions, FY 2010 - FY 2017",
    caption = paste0("Source: ", url),
   x = "", y = ""
  ) +
  scale_color_brewer("", palette = "Paired")

The number of GINA claims rose for the first few years, but then declined down to enactment-year levels. This could represent support for the opposing lawyer’s argument that enforcement is waning. However, it could just as likely be showing that the deterrent effect of the law has proven effective, and most firms subject to GINA are now aware of the law and have taken appropriate steps toward compliance. Given these competing interpretations, we’ll need to look at little deeper to see if we can derive trends from the data.

GINA Claim Resolutions

One of the arguments made by the opposing lawyer is that the Obama administration was pushing GINA enforcement, and that the Trump administration hates the law and won’t enforce it. We can look at the resolution types to test this hypothesis:

GINA.resolutions %>%
  filter(metric != "Receipts" & metric != "Resolutions") %>%
  ggplot(aes(year, value)) +
  geom_line() +
  facet_wrap(~ metric, scales = "free_y") +
  labs(
    title = "EEOC Enforcement of GINA Charges",
    subtitle = "Resolutions by Type, FY 2010 - FY 2017",
    caption = paste0("Source: ", url),
    x = "", y = ""
  )

In 2017, the first year of the Trump administration, administrative closures were down and resolutions on the merits were up, which contradicts the opposing lawyer’s argument. While findings of no reasonable cause were up about 10-15%, findings of reasonable cause were up 50%; if anything, this also contradicts the opposing lawyer’s argument. Monetary settlements appear to be relatively flat from 2013 – 2017, and in any event a million dollars isn’t a ton of money in light of the EEOC’s annual budget of about $376 million (note that the EEOC handles many other types of charges besides GINA).

The resolution type that jumped most markedly in 2017 was “unsuccessful conciliation.” A conciliation is where the EEOC “attempt[s] to achieve a just resolution of all violations found and to obtain agreement that the respondent will eliminate the unlawful employment practice and provide appropriate affirmative relief.” 29 C.F.R. § 1601.24. It’s unclear why this jump occurred from the summary statistics provided by the EEOC.

Finally, I thought it was useful to plot all the resolution types together to show relative numbers:

GINA.resolutions %>%
  filter(metric != "Receipts" & metric != "Resolutions" & metric != "Monetary Benefits (Millions)*") %>%
  ggplot(aes(year, value, color = metric)) +
  geom_line() +
  labs(
    title = "EEOC Enforcement of GINA Charges",
    subtitle = "Resolutions by Type, FY 2010 - FY 2017",
    caption = paste0("Source: ", url),
    x = "", y = ""
  ) +
#  scale_y_sqrt() +
  scale_color_brewer("Resolution Type", palette="Paired")

EEOC Enforcement of GINA Charges, Resolutions by Type FY2010 - FY 2017From this perspective, it does look like the long-term trend has been for the EEOC to dismiss a majority of GINA-related charges as unfounded (i.e., no reasonable cause). However, for the cases that do have merit, it appears the EEOC has reversed an initial trend showing a preference toward administrative closure.

Conclusion

In all, I didn’t find the opposing lawyer’s argument particularly compelling in light of the data from President Trump’s first year in office. However, the first month of 2017 was President Obama’s last in office, and there was a flurry of activity by many regulatory agencies. It wouldn’t surprise me if EEOC also participated in a high volume of lame-duck activity, and a lot of activity in January 2017 could haved skewed the annual results. Monthly statistics would be nice but didn’t appear to be readily available. The goal with any R project is for it to be repeatable with additional data, so it will be interesting to see what the data from FY2018 shows.

This wasn’t a particularly complicated coding project – in fact, this writeup took me longer to produce than writing the actual code and coming to conclusions about whether GINA is on its last leg or not. Despite that fact, I thought it was a good example of how data science can be used to inform solutions to simple as well as complex problems.

Exercise dashboard

I posted a while back about using joy plots for heart rate data. Over the past couple of months, I grew tired of opening RStudio every time I wanted to look through my fitness tracker data. I decided to create a shiny dashboard that I can load via web browser. This involved setting up a shiny server in a dockerized container that runs an Rmd file. I’ve had issues getting plots to size themselves appropriately using base shiny, so I used the flexdashboard package to created a dashboard that will automatically resize.

I wanted to be able to look at individual workouts as well as weekly statistics. To remain consistent with the heart rate app I use, MotiFIT, I copied the charting style for individual workouts. The screenshot at right shows how the app displays heart rate. I find this view useful during weightlifting sessions because I can tell when I’ve rested sufficiently to start another set.

The dashboard has a date selector so you can use small multiples to compare several workouts at once. The last workout in the multiples corresponds to the screenshot from the app, and the similarity is apparent.

The weekly view shows the aggregate amount of time spent in each heart rate zone. You can see a big surge in the number of hours of exercise in September 2017, which is when I started playing in a tennis league. There are noticeable gaps in December and January, when I got sick and then suffered a couple of injuries. I’ve slowly added more hours back as I’ve rehabbed the injuries.

There are a couple of other views, but I’m still trying to decide how to display some of the concepts effectively. The code for the flexdashboard is below.

---
title: "Exercise Dashboard"
runtime: shiny
output:
flexdashboard::flex_dashboard:
orientation: rows
vertical_layout: fill
---

```{r setup, include=FALSE}

library(lubridate)
library(readr)
library(tidyverse)
library(plotly)
library(flexdashboard)
library(ggridges)
library(data.table)
library(scales)

ifelse(
dir.exists("/shiny-server/data/dir"),
datadir <- "shiny-server/data/dir",
datadir <- "/local/data/dir"
)
```

```{r load HR data}

out.file <- data.frame(timestamp = as.POSIXct(character()), bpm = integer(), stringsAsFactors = FALSE)
file.names <- dir(datadir, pattern = "^.*HeartRateData.*.csv") # You'll need to set datadir

library(doParallel)
registerDoParallel()

dir.traversal <- system.time({
# out.file <- foreach (file.name = tail(file.names, 10), .combine = "rbind") %dopar% {
out.file <- foreach (file.name = file.names, .combine = "rbind") %dopar% {
# for (file.name in file.names) {
# exported files are in at least 2 different encodings, so we're going to guess using the guess_encoding function from the readr package
# We can't use readr's read_csv or data.table's fread because some of the encodings are UTF-16LE, which makes this process s
encoding <- guess_encoding(paste(datadir, "/", file.name, sep=""), n_max = 1000)
file <- read.csv(
paste(datadir, "/", file.name, sep=""),
skip = 1,
sep = ",",
strip.white = TRUE,
stringsAsFactors = FALSE,
fileEncoding = toString(encoding[1,1])
)

# exported data has 2 or three columns, so if it's three, we're going to join the date and time fields
if (length(file) > 2) {
temp.date <- file[,1]
temp.time <- trimws(file[,2])
temp.bpm <- trimws(file[,3])

# date format is Wed Feb 8 2017
temp.datetime <- as.POSIXct(paste(temp.date, temp.time, sep=" "), format = "%a %b %d %Y %H:%M")

temp.df <- data.frame(temp.datetime, temp.bpm, stringsAsFactors = FALSE)
names(temp.df) <- c("timestamp", "bpm")
file <- temp.df
rm(temp.df)
} else {
names(file) <- c("timestamp", "bpm")
file$timestamp <- as.POSIXct(file$timestamp)
}

file$name <- file.name

#out.file <- rbind(out.file, file)
file
}
})

```

```{r add week number }
# Used to calculate week number below
start.date <- as.Date(as.character(min(out.file$timestamp)))

week.starts <- seq(from = start.date, to = as.Date(Sys.Date()), by = 7)
week.starts <- data.frame(
week = 0:(NROW(week.starts) - 1),
`week of` = as.Date(week.starts)
)

HR <- out.file %>%
mutate(
week = as.numeric(as.Date(timestamp, tz = Sys.timezone()) - start.date) %/% 7,
bpm = as.numeric(bpm)
) %>%
left_join(week.starts, by = c("week" = "week")) %>%
filter(! is.na(week)) %>%
arrange(week)

HR$week.of <- factor(HR$week.of)
HR$`week.of` <- factor(HR$`week.of`, ordered = T, rev(unique(HR$`week.of`)))
# HR$week <- factor(HR$week, ordered = TRUE, levels = rev(unique(HR$week)))

```

```{r set HR zones}

cuts <- c(-Inf, 109, 123, 138, 164, Inf)
labs <- c("Warm Up", "Fitness", "Endurance", "Hardcore", "Red Line")
HR$zone <- cut(HR$bpm, breaks = cuts, labels = labs, include.lowest=TRUE, ordered_result = TRUE)
fitness.rainbow <- c("royalblue", "royalblue", "green", "yellow", "orange", "red")
rects <- data.frame(ystart = cuts[1:5], yend = cuts[2:6], zone = factor(labs, levels = rev(labs), ordered = TRUE))

bpm.min <- reactive({
min(HR.filtered()$bpm, na.rm = T)
})
bpm.max <- reactive({
max(HR.filtered()$bpm, na.rm = T)
})

zone.breaks <- reactive({
c(
bpm.min(),
(bpm.min() + cuts[2]) / 2,
(cuts[2] + cuts[3]) / 2,
(cuts[3] + cuts[4]) / 2,
(cuts[4] + cuts[5]) / 2,
bpm.max()
)
})

```

```{r make HR reactive }

HR.filtered <- reactive({
HR %>%
filter(as.Date(timestamp, tz = Sys.timezone()) >= input$date[1]) %>%
filter(as.Date(timestamp, tz = Sys.timezone()) <= input$date[2])
})

```

```{r calc durations }

HR.duration <- reactive({
HR.filtered() %>%
mutate(date = as.Date(format(timestamp, "%Y-%m-%d"))) %>%
group_by(name, date) %>% # name is filename so two files from same date don't skew calculations
summarize(start = min(timestamp, na.rm = T),
end = max(timestamp, na.rm = T),
`duration (m)` = round(as.numeric(difftime(end, start, units = "mins")), 1),
`average bpm` = round(mean(as.numeric(bpm)), 0),
sd = round(sd(as.numeric(bpm)), 1),
max.bpm = max(as.numeric(bpm), na.rm = T)) %>%
ungroup() %>%
mutate(start = format(start, "%H:%M:%S"),
end = format(end, "%H:%M:%S"))
})

HR.daily.duration <- reactive({
HR %>%
filter(as.Date(timestamp, tz = Sys.timezone()) >= input$dailyDate[1]) %>%
filter(as.Date(timestamp, tz = Sys.timezone()) <= input$dailyDate[2]) %>%
mutate(date = as.Date(format(timestamp, "%Y-%m-%d"))) %>%
group_by(name, date) %>%
summarize(start = min(timestamp, na.rm = T),
end = max(timestamp, na.rm = T),
`duration (m)` = round(as.numeric(difftime(end, start, units = "mins")), 1),
`average bpm` = round(mean(as.numeric(bpm)), 0),
sd = round(sd(as.numeric(bpm)), 1),
max.bpm = max(as.numeric(bpm), na.rm = T)) %>%
ungroup() %>%
mutate(start = format(start, "%H:%M:%S"),
end = format(end, "%H:%M:%S"))
})

HR.zone.duration.weekly <- reactive({
HR.filtered() %>%
group_by(week, zone) %>%
summarize(
`duration (s)` = n(),
`duration (m)` = round(n() / 60, 1),
`duration (h)` = round(n() / 3600, 1)
) %>%
ungroup() %>%
left_join(week.starts, by = "week") %>%
filter(! is.na(week)) %>%
arrange(week)
})

HR.weekly.duration <- reactive({
HR.zone.duration.weekly() %>%
group_by(week, `week.of`) %>%
summarize(
`duration (h)` = sum(`duration (h)`),
`exercise sessions` = n()
) %>%
ungroup() %>%
arrange(week)
})

HR.zone.duration.daily <- reactive({
HR %>%
mutate(date = as.Date(timestamp, tz = Sys.timezone())) %>%
filter(date >= input$dailyDate[1]) %>%
filter(date <= input$dailyDate[2]) %>%
group_by(date, zone) %>%
summarize(
`duration (s)` = n(),
`duration (m)` = round(n() / 60, 1),
`duration (h)` = round(n() / 3600, 1)
) %>%
ungroup()
})

```

Weekly
=======================================================================

Inputs {.sidebar}
-----------------------------------------------------------------------

Enter a date range for the weekly charts:

```{r input date range for weekly chart }

dateRangeInput("date", "Date Range", start = as.Date(min(HR$timestamp, na.rm = T), tz = Sys.timezone()))

```

Column
-----------------------------------------------------------------------

### Weekly Exercise

```{r weekly barplot}

renderPlotly({
ggplot(data = HR.zone.duration.weekly(), aes(x = `week.of`, y = `duration (h)`, fill = zone)) +
geom_bar(
stat = "identity",
position = "stack"
) +
scale_fill_manual(values = fitness.rainbow[2:6]) +
scale_y_continuous(breaks = seq(from = 0, to = 8, by = 2), minor_breaks = seq(from = 0, to = 8, by = 1))
})

```

Column
-----------------------------------------------------------------------

### Heartrate Zones

```{r ridgeline plot}

renderPlot({
ggplot(
HR.filtered(),
aes(x = bpm, y = `week.of`, fill = ..x..)
) +
scale_fill_gradientn(
colors = fitness.rainbow,
breaks = zone.breaks()
) +
scale_x_continuous(breaks = function(x) {seq(from = 0, to = max(x), by = 10)}) +
geom_density_ridges_gradient(na.rm = TRUE, col = "grey70", scale = 4) +
theme_ridges(font_size = 7) +
theme(
legend.position = "none"
)
})

```

### Weekly Durations

```{r total duration bar plot}

renderPlotly({
ggplot(data = HR.weekly.duration(), aes(x = `week.of`, y = `duration (h)`)) +
geom_bar(
aes(
text = paste(
"week: ", week, "<br>",
"# of sessions: ", `exercise sessions`,
sep = ""
)
),
stat = "identity"
) +
geom_smooth(span = 0.35)
})

# renderTable({
# head(HR.weekly.duration())
# })

```

Daily {data-orientation=rows}
=======================================================================

Inputs {.sidebar}
-----------------------------------------------------------------------

Enter a date range for the daily charts:

```{r input date range for daily calcs }

dateRangeInput("dailyDate", "Date Range", start = as.Date(Sys.Date() %m+% days(-15)))

```

Column
-----------------------------------------------------------------------

### Daily Heartrate Zones

```{r individual exercise zones}

renderPlotly({
ggplot(data = HR.zone.duration.daily(), aes(x = date, y = `duration (m)`, fill = zone)) +
geom_bar(
stat = "identity",
position = "dodge"
) +
scale_fill_manual(values = fitness.rainbow[2:6]) +
scale_y_continuous(breaks = seq(from = 0, to = max(HR.zone.duration.daily()$`duration (m)`, na.rm = TRUE), by = 15)) +
theme(
legend.position = "none"
)
})

```

### Daily Workouts, BPM vs. Duration

```{r bubble plot}

renderPlotly({
ggplot(HR.daily.duration(), aes(x = `duration (m)`, y = `average bpm`)) +
geom_point(aes(color = `average bpm` + sd, size = -sd), alpha = 0.5) +
scale_color_gradientn(
colors = fitness.rainbow[2:6], # "royalblue" "green" "yellow" "orange" "red"
values = rescale(zone.breaks(), to = c(0, 1)), # 54 81.5 116 130.5 151 190
na.value = "black",
breaks = cuts, # -Inf 109 123 138 164 Inf
limits = c(min(zone.breaks()), max(zone.breaks()))
) +
scale_x_continuous(
limits = c(0, as.numeric(max(HR.daily.duration()$`duration (m)`))),
breaks = seq(from = 0, to = as.numeric(max(HR.daily.duration()$`duration (m)`)), by = 15)
) +
scale_y_continuous(
limits = c(min(zone.breaks()), max(zone.breaks())),
breaks = cuts[2:5],
minor_breaks = NULL
) +
#geom_smooth(span = 0.35) +
theme(
legend.position = "none"
)
})

# renderTable({
# head(HR.daily.duration())
# })

# renderText({
# rescale(
# HR.daily.duration()$max.bpm,
# from = c(min(zone.breaks()), max(zone.breaks())),
# to = c(0, 1)
# )
# })

```

Column
-----------------------------------------------------------------------

### Heartrate Curves

```{r individual exercise plots}

renderPlot({

temp <- HR.filtered() %>%
# temp <- HR %>%
mutate(date = as.Date(timestamp, tz = Sys.timezone())) %>%
filter(as.Date(timestamp, tz = Sys.timezone()) >= input$dailyDate[1]) %>%
filter(as.Date(timestamp, tz = Sys.timezone()) <= input$dailyDate[2])

ggplot() +
geom_rect(data = rects, aes(ymin = ystart, ymax = yend, fill = zone), xmin=-Inf, xmax=Inf, inherit.aes = FALSE) +
geom_ribbon(data = temp, aes(x = timestamp, ymax = bpm, ymin = min(temp$bpm)), color = "white", fill="grey90", alpha = .5) +
scale_fill_manual(values = rev(fitness.rainbow[2:6])) +
theme_minimal() +
facet_wrap(~ name, scales = "free_x") +
scale_x_datetime(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), breaks = cuts[2:5]) +
theme(panel.grid = element_blank(), panel.border = element_blank())
})

```

The “joy” of plotting heartrate data

There’s been quite a few posts lately in the R world about the joyplot. I emailed my dataviz instructor from last fall this wonderful post that shows how Congress has become more polarized on one side of the aisle. He pointed out this isn’t as new as everyone thinks it is, as Tufte gives an example of it in one of his books from the early 1900s (this image is from slide 71 of this presentation).

In any event, joyplots are great for comparing values over time. I’ve been wearing a heartrate monitor while at the gym since February, and the MotiFit app lets you export that data to a CSV file. I’ve been saving these files each day after working out, but I haven’t done much with it yet. After gathering 6 months of data, it seemed high time to play around with joyplots!

The first thing I needed to do was combine all the files, which wasn’t as straightforward as I hoped. The MotiFit app exported data in two different formats and two different encodings, so I had to account for this. I wound up guessing the encodings from the readr package, and then combining the date and time columns where necessary.

out.file <- data.frame()
file.names <- dir(datadir, pattern = "HeartRateData.+?.csv") # You'll need to set datadir

library(readr)
for (i in 1:length(file.names)){
  # exported files are in at least 2 different encodings, so we're going to guess using the guess_encoding function from the readr package 
  encoding <- guess_encoding(paste(datadir, "/", file.names[i], sep=""), n_max = 1000)
  file <- read.csv(
    paste(datadir, "/", file.names[i], sep=""), 
    skip = 1,
    stringsAsFactors = FALSE, 
    fileEncoding = toString(encoding[1,1])
  )
  
  # exported data has 2 or three columns, so if it's three, we're going to join the date and time fields
  if (length(file) > 2) {
    temp.date <- file[,1]
    temp.time <- trimws(file[,2])
    temp.bpm <- trimws(file[,3])
    
    # date format is Wed Feb 8 2017
    temp.datetime <- as.POSIXct(paste(temp.date, temp.time, sep=" "), format = "%a %b %d %Y %H:%M")
    
    temp.df <- data.frame(temp.datetime, temp.bpm)
    names(temp.df) <- c("timestamp", "bpm")
    file <- temp.df
    rm(temp.df)
  } else {
    names(file) <- c("timestamp", "bpm")
    file$timestamp <- as.POSIXct(file$timestamp)
  }
  
  out.file <- rbind(out.file, file)
}

Next, I added a week number column. I’ve been lifting since March on the Wendler 5/3/1 program, which has 4-week cycles of varying intensity (including a deload week). So, the intensity of exercise depends primarily on which week it is.

library(dplyr)

# Used to calculate week number below
start.date <- min(as.Date(out.file$timestamp), na.rm=TRUE)

HR <- out.file %>%
  mutate(
    week = factor(as.numeric(as.Date(timestamp) - start.date) %/% 7, ordered = F),
    bpm = as.numeric(bpm)
  ) %>%
  filter(! is.na(week)) %>%
  arrange(week)

# Reorder factor so oldest week is at top
HR$week <- factor(HR$week, ordered = TRUE, levels = rev(unique(HR$week)))

small multiples

Now that we have the data we need, we can apply the ggjoy! I’m using a custom color scale based on the figures the MotiFit app gives me. To make these look right, however, you need to set the breaks at the midpoint between each value (something I learned in the BMI post).

Here’s the result:

library(ggjoy)

bpm.min <- min(HR$bpm, na.rm = T)
bpm.max <- max(HR$bpm, na.rm = T)

breaks <- c(
  bpm.min,
  (bpm.min + 109) / 2,
  (109 + 123) / 2, 
  (123 + 138) / 2, 
  (138 + 164) / 2,
  bpm.max
)

ggplot(
  HR,
  aes(x = bpm, y = week, height = ..density.., fill = ..x..)
) +
  scale_fill_gradientn(
    colors = c("royalblue", "royalblue", "green", "yellow", "orange", "red"),
    breaks = breaks
  ) +
  geom_joy_gradient(na.rm = TRUE, col = "grey70", scale = 4) +
  theme_joy(font_size = 10) +
  theme(
    legend.position = "none"
  )

The thing that jumps out at me is that you can really identify which weeks I was good about doing cardio work at a target heart rate around 140, and which weeks I wasn’t.