Consecutive clustering

Clustering consecutive values in a single dimension to reveal bouts of activity

r
r shiny
clustering
collab

My friend was telling me about a data bottleneck that was preventing movement on a specific project. The project involved classifying bouts of inactivity from actigraphy measures (y, activity level) over time (x, 5-minute bins). The whole process could take a week of concerted effort and involved trained observers and these rules to identify bouts of inactivity:

  1. Inactivity is defined by a threshold ratio (usually 10%) of the maximum activity recorded during the session for each individual.

  2. A sudden dip below threshold does not count as inactivity. There need to be two consecutive times of inactivity (10 minutes when using 5 minute bins) to define a bout of inactivity.

  3. The first portion of recorded time is not a bout of inactivity. That is the acclimation phase before measuring is acquired.

Here are the rules in action:

Take 1: single-dimension clustering

After hearing these rules, I knew they could be written into an algorithm. I began by searching Stack Overflow for information on clustering along a single dimension (time). At that time, I was not able to identify a function for that use case so I made my own. The function contained nested for loops and extracted the specified information within ~30 seconds (for 7 different individuals). I wrapped the function into my first R Shiny app for my friend to use. Because of the processing speed, my app only had basic interactivity (upload and download data). Further steps toward interactivity, like threshold setting, were appealing but seemed out of reach with the original algorithm.

Take 2: incorporating consecutive_id and tidy code

While learning and relearning about data approaches with R for Data Science 2e, I decided to revisit the clustering problem. I was inspired after reading about logical vectors (Ch. 13), numeric vectors (Ch. 14), and the function dplyr::consecutive_id. The function matched what I was trying to achieve earlier and was also incorporated into the tidyverse. In fact, I rewrote the algorithm in far less code than before and output the same information (on the same data) in 0.1 seconds. That’s a time improvement of about 99.5% over the first attempt! The updated code behind the definition for inactivity bouts is below:

Code
##sample data with the appearance of input data
load(file = "data/ex_data.rdata")

##clean the data
piv_data <- pivot_longer(
  ex_data,
  cols = names(ex_data[,-1]),
  names_to = "name",
  cols_vary = "fastest"
  )
    
names(piv_data) <- c("time", "name", "value")

##derive threshold
piv_data <- piv_data |> 
         group_by(name) |> 
         mutate(
           max = max(value),
           threshold = max*0.1 
         )

#clustering algorithm
  piv_data_clust <- piv_data |> 
    group_by(name) |> 
    mutate(id = consecutive_id(value)) |> 
    filter(id != 1) |>                               #(3) remove the first set of zeros
    mutate(
      categ = if_else(value > threshold, 0, 1),      #(1) use threshold to categorize vals
      clust_id = consecutive_id(categ)               #set id for consecutive vals
    ) |> 
    group_by(name, categ, clust_id) |> 
    mutate(
      n = n()                                        #get group size
    ) |> 
    group_by(name, clust_id) |> 
    filter(categ == 1 & n > 1) |>                    #(2) filter for size > 1
    select(!c(id, categ))
  
  glimpse(piv_data_clust)
Rows: 65
Columns: 7
Groups: name, clust_id [27]
$ time      <int> 4, 5, 5, 6, 6, 7, 8, 20, 21, 22, 27, 28, 29, 44, 45, 60, 61,…
$ name      <chr> "obs_3", "obs_2", "obs_3", "obs_2", "obs_3", "obs_3", "obs_3…
$ value     <int> 17, 11, 17, 10, 13, 5, 1, 1, 16, 4, 9, 10, 17, 5, 19, 10, 14…
$ max       <int> 196, 198, 196, 198, 196, 196, 196, 196, 196, 196, 198, 198, …
$ threshold <dbl> 19.6, 19.8, 19.6, 19.8, 19.6, 19.6, 19.6, 19.6, 19.6, 19.6, …
$ clust_id  <int> 3, 2, 3, 2, 3, 3, 3, 11, 11, 11, 14, 14, 14, 8, 8, 16, 16, 2…
$ n         <int> 5, 2, 5, 2, 5, 5, 5, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, …

In the code chunk, I used an example input to clean and cluster data. Glimpsing the output shows the computed values indicated in the code chunk, the key output values of clust_id and n to identify and size clusters in the data. Because the computation time is quicker, user-defined thresholds and interactive plots are easier to implement with updates closer to real time.

User-defined inactivity clusters with R Shiny

The R Shiny app for clustering actigraphy data uses example data (generated by random sampling) to showcase interactive threshold values, plots, and tabular data. The app also takes .csv input. The first column should be the time values (x) and the other named columns contain activity data, a single individual per column. Interactivity in the app can be used for up to 10 individuals (11 total columns, 1 time and 10 observations).

Within the app and in the figure, the plots reveal an (unintended) aspect of the data: the longer lines connecting low-to-high thresholded points might suggest exploring leading or lagging values with dplyr::lag and dplyr::lead. From a biological perspective, these values might be interpreted as a relative “hyperlocomotion” entering or exiting an inactivity bout. This type of exploratory data analysis applies to other metrics over time, like prices or internet activity.

In this post, I shared how my clustering algorithm evolved and how I improved my first R Shiny app with tidy coding principles. IMO, revisiting problems is a great way to track personal improvement especially over periods of skill growth. A fun part about making apps to share coding solutions is the relief and excitement felt by colleagues. So far, this app has officially inspired at least one person to learn R!


Session Info
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.3 (2023-03-15 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2024-07-31
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
 quarto   1.4.550 @ C:\\Users\\barne\\AppData\\Local\\Programs\\Quarto\\bin\\quarto.exe

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.2.3)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.2.3)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.2.3)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.2.3)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.2.3)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.2.3)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.3)

 [1] C:/Users/barne/AppData/Local/R/win-library/4.2
 [2] C:/Program Files/R/R-4.2.3/library

──────────────────────────────────────────────────────────────────────────────