feat(scripts): add automated ODD memory usage analysis #16847

mjhuff · 2024-11-15T15:09:51Z

Overview

This PR adds a script and an action for performing ODD memory usage analysis in an automated manner.

Recently, we added a resource monitor to the shell layer on the ODD. The monitor scapes various data from preselected processes and the ODD in general, pushing this data to Mixpanel.

While the data is nice and possible to analyze, there are really two problems with it:

Post-processing the data in Mixpanel itself is tricky, and directly exporting the data for analysis via a spreadsheet, etc. is not super straightforward (the script solves this one).
It's not automatic (the action solves this one).

What Results Indicate (And Don't Indicate)

This script answers "is memory consumption increasing as uptime increases on a monitored ODD process" and "is ODD memory decreasing as uptime", both pointing to broad, general-usage memory leaks. Note that it is definitely possible for memory leaks to occur in short-lived function calls/views/etc, and these leaks likely would not be captured by this analysis.

How the script works

The script is the interesting part of this PR. On a high level, it:

Determine the current, valid build of the app by looking at the app manifest. Also get X number of past, valid builds relative to the current build (the hardcoded number is 2 here). Example,chore_release-8.2.0 or 8.1.0.
Using those build strings, query Mixpanel for the past month of resource data for those builds. Each build gets a separate query, and while we could do this in one query, we are allowed more data by using separate queries.
Parse the data. We're interested in things like system uptime, process names, process memory usage, and general ODD available memory.
Do statistics. After some reading, Pearson's correlation coefficient seems like the correct thing to be using here. Note that the interpretation specifies different numbers for indicating a weak/moderate/strong trend than this specific source, because there seems to be some conjecture around this, but I don't think it matters that much in the grand scheme of things. There are some very minor, semi subjective magic numbers to consider like pvalues and minimum sample sizes, but again, it seems that anything reasonable is probably fine.
Do math. Because it's a bit easier to understand at first glance, the script puts memory consumption/availability averages into uptime buckets on a per process basis. Note that the script's implementation of Pearson's does NOT do correlations by comparing buckets of averages. We actually plot each memory usage data point relative to uptime.
When doing statistics and math, we aggregate certain processes, AGGREGATED_PROCESSES, since for example, electron's renderer process can include a lot of unique flags that create a lot of unnecessary disparate results when taken individually. Similarly, there are a lot of processes we don't care to analyze, such as python3 processes, so these are added to a BLACKLISTED_PROCESSES and not included in the final results.
We report these results.

Some implementation notes

I decided against parameterizing a couple things that could be debatably parameterized, notably the number of past valid builds to report too and the timeframe to analyze (we always look at the past month of available data), because I don't think this matters that much. Once a build is built, we don't expect to see any real change in memory performance month-to-month, and if users really want to compare against previous builds with resource monitor data, it's probably best to look at previous CI runs than doing a lot of rapid querying of the Mixpanel API.

Sample output

ODD Available Memory and Processes with Increasing Memory Trend by Version (Rolling 1 Month Analysis Window):

8.2.0-alpha.2: "N/A - No data available"

8.2.0-alpha.1: {
  "robot-server-uvicorn-processes": {
    "correlation": "0.2898",
    "sampleSize": 499,
    "interpretation": "Weak positive correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "187.94",
      "20-40hrs": "199.77",
      "40-60hrs": "N/A - Not enough data available.",
      "60-80hrs": "N/A - Not enough data available.",
      "80-120hrs": "N/A - Not enough data available.",
      "120-240hrs": "N/A - Not enough data available.",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "app-renderer-processes": {
    "correlation": "0.1756",
    "sampleSize": 561,
    "interpretation": "Weak positive correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "136.10",
      "20-40hrs": "151.55",
      "40-60hrs": "N/A - Not enough data available.",
      "60-80hrs": "N/A - Not enough data available.",
      "80-120hrs": "N/A - Not enough data available.",
      "120-240hrs": "N/A - Not enough data available.",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "odd-available-memory": {
    "correlation": "-0.2863",
    "sampleSize": 513,
    "interpretation": "Weak negative correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "1127.02",
      "20-40hrs": "1108.41",
      "40-60hrs": "N/A - Not enough data available.",
      "60-80hrs": "N/A - Not enough data available.",
      "80-120hrs": "N/A - Not enough data available.",
      "120-240hrs": "N/A - Not enough data available.",
      "240+hrs": "N/A - Not enough data available."
    }
  }
}

8.2.0-alpha.0: {
  "robot-server-uvicorn-processes": {
    "correlation": "0.1196",
    "sampleSize": 1643,
    "interpretation": "Weak positive correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "198.24",
      "20-40hrs": "222.17",
      "40-60hrs": "231.81",
      "60-80hrs": "234.99",
      "80-120hrs": "223.96",
      "120-240hrs": "222.31",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "app-renderer-processes": {
    "correlation": "0.6037",
    "sampleSize": 1851,
    "interpretation": "Moderate positive correlation (>0.3 and <0.7)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "161.67",
      "20-40hrs": "183.64",
      "40-60hrs": "200.20",
      "60-80hrs": "249.22",
      "80-120hrs": "309.07",
      "120-240hrs": "555.83",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "odd-available-memory": {
    "correlation": "-0.7721",
    "sampleSize": 1725,
    "interpretation": "Strong negative correlation (>0.7)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "1083.79",
      "20-40hrs": "1028.03",
      "40-60hrs": "986.28",
      "60-80hrs": "916.17",
      "80-120hrs": "848.11",
      "120-240hrs": "613.95",
      "240+hrs": "N/A - Not enough data available."
    }
  }
}

Test Plan and Hands on Testing

Ran the script manually a bunch of time. It does what it should do.
See review requests

Changelog

Added automated analysis of monitored Flex resource data.

Review requests

So it seems like the only real way to test a github action is by merging it? If so, does the .yaml look like it might work?
The script gets the app manifest by querying a hard-coded URL. Is there a preferable way to do this or are we ok doing that?
I decided not to make this a pass/fail action, since I think it's important to contextualize results, view them on a semi-regular cadence, and IMO, it's misleading to say "we don't have a memory leak until correlation X is reached." Since this action runs on a cronjob and isn't directly tied to a PR, I don't think pass/fail matters much anyway. All this being said, I'm more than happy to tie this to pass/fail if others feel differently.

Risk assessment

low

mjhuff · 2024-11-15T15:15:34Z

.github/workflows/odd-memory-usage-test.yaml

+on:
+  schedule:
+    - cron: '30 12 * * *'
+  workflow_dispatch:


I believe this line lets us manually run an action on-the-fly from the Actions tab? Is that true?

That's correct, yeah. But it may have to be merged to the default branch first so it actually shows up in the actions tab.

sfoster1

I think this is complex enough that it should just be a javascript action with its own deps that lives in .github/actions/odd-memory-stats or something, but it does look great!

sfoster1 · 2024-11-15T15:22:21Z

.github/workflows/odd-memory-usage-test.yaml

+          MIXPANEL_INGEST_SECRET: ${{ secrets.MIXPANEL_INGEST_SECRET }}
+          MIXPANEL_PROJECT_ID: ${{ secrets.OT_APP_MIXPANEL_ID }}
+        run: |
+          OUTPUT=$(node ./scripts/resource-monitor/perform-memory-analysis "$MIXPANEL_INGEST_USER" "$MIXPANEL_INGEST_SECRET" "$MIXPANEL_PROJECT_ID")


these don't have to be passed through the environment; you can have this be $(note ./scripts/resource-monitor/perform-memory-analysis ${{ secrets.MIXPANEL_INGEST_USER }} for instance.

even better, and more on this later, is to just not put this through a shell action at all and implement this as a node action and use the api that it gives you, and then you don't have to deal with any of this at all

sfoster1 · 2024-11-15T15:24:57Z

scripts/resource-monitor/lib/analysis.js

@@ -0,0 +1,84 @@
+// Analysis is based on one-tailed, Pearson's correlation coefficient.


if you implement this as a javascript action, you actually bundle locally from a separate deps list (see e.g. the helper action in oe-core; this is a rollup output plus source pair) and you can use a stats package if you want.

sfoster1 · 2024-11-15T15:25:59Z

scripts/resource-monitor/lib/helpers/date.js

+  const formatDate = date => date.toISOString().split('T')[0]
+
+  const monthAgo = new Date(yesterday)
+  monthAgo.setMonth(yesterday.getMonth() - 1)


does this actually work in webapi Date instead of like Arrow? like what happens if you do this on October 31, or I guess on november 1?

sfoster1 · 2024-11-15T15:27:04Z

scripts/resource-monitor/lib/helpers/manifest.js

@@ -0,0 +1,52 @@
+const fetch = require('node-fetch')
+
+const APP_MANIFEST = 'https://builds.opentrons.com/app/releases.json'


since this is ODD data, and ODD versions are actually robot versions, it might make sense to do this from https://builds.opentrons.com/ot3-oe/releases.json

chore(scripts): add automated ODD memory usage analysis

f15d312

mjhuff requested a review from sfoster1 November 15, 2024 15:09

mjhuff requested review from a team as code owners November 15, 2024 15:09

mjhuff requested review from jerader and a team and removed request for a team and jerader November 15, 2024 15:09

mjhuff commented Nov 15, 2024

View reviewed changes

sfoster1 requested changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): add automated ODD memory usage analysis #16847

feat(scripts): add automated ODD memory usage analysis #16847

mjhuff commented Nov 15, 2024 •

edited

Loading

mjhuff Nov 15, 2024

sfoster1 Nov 15, 2024

sfoster1 left a comment

sfoster1 Nov 15, 2024

sfoster1 Nov 15, 2024

sfoster1 Nov 15, 2024

sfoster1 Nov 15, 2024

		@@ -0,0 +1,84 @@
		// Analysis is based on one-tailed, Pearson's correlation coefficient.

		@@ -0,0 +1,52 @@
		const fetch = require('node-fetch')

		const APP_MANIFEST = 'https://builds.opentrons.com/app/releases.json'

feat(scripts): add automated ODD memory usage analysis #16847

Are you sure you want to change the base?

feat(scripts): add automated ODD memory usage analysis #16847

Conversation

mjhuff commented Nov 15, 2024 • edited Loading

Overview

What Results Indicate (And Don't Indicate)

How the script works

Some implementation notes

Sample output

Test Plan and Hands on Testing

Changelog

Review requests

Risk assessment

mjhuff Nov 15, 2024

Choose a reason for hiding this comment

sfoster1 Nov 15, 2024

Choose a reason for hiding this comment

sfoster1 left a comment

Choose a reason for hiding this comment

sfoster1 Nov 15, 2024

Choose a reason for hiding this comment

sfoster1 Nov 15, 2024

Choose a reason for hiding this comment

sfoster1 Nov 15, 2024

Choose a reason for hiding this comment

sfoster1 Nov 15, 2024

Choose a reason for hiding this comment

mjhuff commented Nov 15, 2024 •

edited

Loading