Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scripts): add automated ODD memory usage analysis #16847

Open
wants to merge 1 commit into
base: edge
Choose a base branch
from

Conversation

mjhuff
Copy link
Contributor

@mjhuff mjhuff commented Nov 15, 2024

Closes EXEC-828

Overview

This PR adds a script and an action for performing ODD memory usage analysis in an automated manner.

Recently, we added a resource monitor to the shell layer on the ODD. The monitor scapes various data from preselected processes and the ODD in general, pushing this data to Mixpanel.

While the data is nice and possible to analyze, there are really two problems with it:

  • Post-processing the data in Mixpanel itself is tricky, and directly exporting the data for analysis via a spreadsheet, etc. is not super straightforward (the script solves this one).
  • It's not automatic (the action solves this one).

What Results Indicate (And Don't Indicate)

This script answers "is memory consumption increasing as uptime increases on a monitored ODD process" and "is ODD memory decreasing as uptime", both pointing to broad, general-usage memory leaks. Note that it is definitely possible for memory leaks to occur in short-lived function calls/views/etc, and these leaks likely would not be captured by this analysis.

How the script works

The script is the interesting part of this PR. On a high level, it:

  • Determine the current, valid build of the app by looking at the app manifest. Also get X number of past, valid builds relative to the current build (the hardcoded number is 2 here). Example,chore_release-8.2.0 or 8.1.0.
  • Using those build strings, query Mixpanel for the past month of resource data for those builds. Each build gets a separate query, and while we could do this in one query, we are allowed more data by using separate queries.
  • Parse the data. We're interested in things like system uptime, process names, process memory usage, and general ODD available memory.
  • Do statistics. After some reading, Pearson's correlation coefficient seems like the correct thing to be using here. Note that the interpretation specifies different numbers for indicating a weak/moderate/strong trend than this specific source, because there seems to be some conjecture around this, but I don't think it matters that much in the grand scheme of things. There are some very minor, semi subjective magic numbers to consider like pvalues and minimum sample sizes, but again, it seems that anything reasonable is probably fine.
  • Do math. Because it's a bit easier to understand at first glance, the script puts memory consumption/availability averages into uptime buckets on a per process basis. Note that the script's implementation of Pearson's does NOT do correlations by comparing buckets of averages. We actually plot each memory usage data point relative to uptime.
  • When doing statistics and math, we aggregate certain processes, AGGREGATED_PROCESSES, since for example, electron's renderer process can include a lot of unique flags that create a lot of unnecessary disparate results when taken individually. Similarly, there are a lot of processes we don't care to analyze, such as python3 processes, so these are added to a BLACKLISTED_PROCESSES and not included in the final results.
  • We report these results.

Some implementation notes

I decided against parameterizing a couple things that could be debatably parameterized, notably the number of past valid builds to report too and the timeframe to analyze (we always look at the past month of available data), because I don't think this matters that much. Once a build is built, we don't expect to see any real change in memory performance month-to-month, and if users really want to compare against previous builds with resource monitor data, it's probably best to look at previous CI runs than doing a lot of rapid querying of the Mixpanel API.

Sample output

ODD Available Memory and Processes with Increasing Memory Trend by Version (Rolling 1 Month Analysis Window):

8.2.0-alpha.2: "N/A - No data available"

8.2.0-alpha.1: {
  "robot-server-uvicorn-processes": {
    "correlation": "0.2898",
    "sampleSize": 499,
    "interpretation": "Weak positive correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "187.94",
      "20-40hrs": "199.77",
      "40-60hrs": "N/A - Not enough data available.",
      "60-80hrs": "N/A - Not enough data available.",
      "80-120hrs": "N/A - Not enough data available.",
      "120-240hrs": "N/A - Not enough data available.",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "app-renderer-processes": {
    "correlation": "0.1756",
    "sampleSize": 561,
    "interpretation": "Weak positive correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "136.10",
      "20-40hrs": "151.55",
      "40-60hrs": "N/A - Not enough data available.",
      "60-80hrs": "N/A - Not enough data available.",
      "80-120hrs": "N/A - Not enough data available.",
      "120-240hrs": "N/A - Not enough data available.",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "odd-available-memory": {
    "correlation": "-0.2863",
    "sampleSize": 513,
    "interpretation": "Weak negative correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "1127.02",
      "20-40hrs": "1108.41",
      "40-60hrs": "N/A - Not enough data available.",
      "60-80hrs": "N/A - Not enough data available.",
      "80-120hrs": "N/A - Not enough data available.",
      "120-240hrs": "N/A - Not enough data available.",
      "240+hrs": "N/A - Not enough data available."
    }
  }
}

8.2.0-alpha.0: {
  "robot-server-uvicorn-processes": {
    "correlation": "0.1196",
    "sampleSize": 1643,
    "interpretation": "Weak positive correlation (>=0.3)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "198.24",
      "20-40hrs": "222.17",
      "40-60hrs": "231.81",
      "60-80hrs": "234.99",
      "80-120hrs": "223.96",
      "120-240hrs": "222.31",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "app-renderer-processes": {
    "correlation": "0.6037",
    "sampleSize": 1851,
    "interpretation": "Moderate positive correlation (>0.3 and <0.7)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "161.67",
      "20-40hrs": "183.64",
      "40-60hrs": "200.20",
      "60-80hrs": "249.22",
      "80-120hrs": "309.07",
      "120-240hrs": "555.83",
      "240+hrs": "N/A - Not enough data available."
    }
  },
  "odd-available-memory": {
    "correlation": "-0.7721",
    "sampleSize": 1725,
    "interpretation": "Strong negative correlation (>0.7)",
    "averageMemoryMbByUptime": {
      "0-20hrs": "1083.79",
      "20-40hrs": "1028.03",
      "40-60hrs": "986.28",
      "60-80hrs": "916.17",
      "80-120hrs": "848.11",
      "120-240hrs": "613.95",
      "240+hrs": "N/A - Not enough data available."
    }
  }
}

Test Plan and Hands on Testing

  • Ran the script manually a bunch of time. It does what it should do.
  • See review requests

Changelog

  • Added automated analysis of monitored Flex resource data.

Review requests

  • So it seems like the only real way to test a github action is by merging it? If so, does the .yaml look like it might work?
  • The script gets the app manifest by querying a hard-coded URL. Is there a preferable way to do this or are we ok doing that?
  • I decided not to make this a pass/fail action, since I think it's important to contextualize results, view them on a semi-regular cadence, and IMO, it's misleading to say "we don't have a memory leak until correlation X is reached." Since this action runs on a cronjob and isn't directly tied to a PR, I don't think pass/fail matters much anyway. All this being said, I'm more than happy to tie this to pass/fail if others feel differently.

Risk assessment

low

@mjhuff mjhuff requested a review from sfoster1 November 15, 2024 15:09
@mjhuff mjhuff requested review from a team as code owners November 15, 2024 15:09
@mjhuff mjhuff requested review from jerader and a team and removed request for a team and jerader November 15, 2024 15:09
on:
schedule:
- cron: '30 12 * * *'
workflow_dispatch:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this line lets us manually run an action on-the-fly from the Actions tab? Is that true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, yeah. But it may have to be merged to the default branch first so it actually shows up in the actions tab.

Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is complex enough that it should just be a javascript action with its own deps that lives in .github/actions/odd-memory-stats or something, but it does look great!

MIXPANEL_INGEST_SECRET: ${{ secrets.MIXPANEL_INGEST_SECRET }}
MIXPANEL_PROJECT_ID: ${{ secrets.OT_APP_MIXPANEL_ID }}
run: |
OUTPUT=$(node ./scripts/resource-monitor/perform-memory-analysis "$MIXPANEL_INGEST_USER" "$MIXPANEL_INGEST_SECRET" "$MIXPANEL_PROJECT_ID")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these don't have to be passed through the environment; you can have this be $(note ./scripts/resource-monitor/perform-memory-analysis ${{ secrets.MIXPANEL_INGEST_USER }} for instance.

even better, and more on this later, is to just not put this through a shell action at all and implement this as a node action and use the api that it gives you, and then you don't have to deal with any of this at all

@@ -0,0 +1,84 @@
// Analysis is based on one-tailed, Pearson's correlation coefficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you implement this as a javascript action, you actually bundle locally from a separate deps list (see e.g. the helper action in oe-core; this is a rollup output plus source pair) and you can use a stats package if you want.

const formatDate = date => date.toISOString().split('T')[0]

const monthAgo = new Date(yesterday)
monthAgo.setMonth(yesterday.getMonth() - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually work in webapi Date instead of like Arrow? like what happens if you do this on October 31, or I guess on november 1?

@@ -0,0 +1,52 @@
const fetch = require('node-fetch')

const APP_MANIFEST = 'https://builds.opentrons.com/app/releases.json'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is ODD data, and ODD versions are actually robot versions, it might make sense to do this from https://builds.opentrons.com/ot3-oe/releases.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants