Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encapsulate general xml file parsing #22

Open
2 of 3 tasks
aaron-schroeder opened this issue Feb 16, 2023 · 0 comments
Open
2 of 3 tasks

Encapsulate general xml file parsing #22

aaron-schroeder opened this issue Feb 16, 2023 · 0 comments
Labels
enhancement New feature or request Refactor Internal refactoring of code

Comments

@aaron-schroeder
Copy link
Owner

aaron-schroeder commented Feb 16, 2023

Rather than me assuming a specific type of xml file structure (eg tcx Activity), let's just read in whatever elements happen to be in the file, and we can figure out what to do with the data later. This more closely aligns with the way .fit files are read in with fitparse (although that relies on a schema-like profile file).

Ideas

names :  list-like, optional
    Column names for DataFrame of parsed XML data. Use this parameter to
    rename original element names and distinguish same named elements and
    attributes.

dtype : Type name or dict of column -> type, optional
    Data type for data or columns. E.g. {{'a': np.float64, 'b': np.int32,
    'c': 'Int64'}}
    Use `str` or `object` together with suitable `na_values` settings
    to preserve and not interpret dtype.
    If converters are specified, they will be applied INSTEAD
    of dtype conversion.
    .. versionadded:: 1.5.0

converters : dict, optional
    Dict of functions for converting values in certain columns. Keys can either
    be integers or column labels.

A completely different, but intuitive and straightforward, way to flatten row elements:

iterparse : dict, optional
    The nodes or attributes to retrieve in iterparsing of XML document
    as a dict with key being the name of repeating element and value being
    list of elements or attribute names that are descendants of the repeated
    element. Note: If this option is used, it will replace ``xpath`` parsing
    and unlike xpath, descendants do not need to relate to each other but can
    exist any where in document under the repeating element. This memory-
    efficient method should be used for very large XML files (500MB, 1GB, or 5GB+).
    For example, ::
        iterparse = {{"row_element": ["child_elem", "attr", "grandchild_elem"]}}
    .. versionadded:: 1.5.0

Progress

  • Update pandas requirement:

    • 1.5, read_xml got a parse_dates option that I use.
    • 1.3, read_xml was added
  • xslt transform result (flattened xml file)

When I check the result of the xslt transform, the new document has namespace issues. (I don't think this matters for pandas, but something to keep an eye on).

<PositionLatitudeDegrees xmlns="" xmlns:default="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"><value-of select="./default:LatitudeDegrees"/></PositionLatitudeDegrees>

Probably has to do with the namespaces in the generated xslt script (there is no default IIRC, just a namespace xlmns:default). And then the generated script has elements like PositionLatitudeDegrees with no namespace specified.

  • generated xslt script
  • Currently, a separate xsl:template element is created for each field in a multi-field field:
  <xsl:template match="/default:TrainingCenterDatabase/default:Activities/default:Activity/default:Lap/default:Track/default:Trackpoint/default:Extensions">
    <ExtensionsTPXRunCadence>
      <value-of select="./ns3:TPX/ns3:RunCadence"/>
    </ExtensionsTPXRunCadence>
  </xsl:template>
  <xsl:template match="/default:TrainingCenterDatabase/default:Activities/default:Activity/default:Lap/default:Track/default:Trackpoint/default:Extensions">
    <ExtensionsTPXSpeed>
      <value-of select="./ns3:TPX/ns3:Speed"/>
    </ExtensionsTPXSpeed>
  </xsl:template>

...which results in only one field getting copied over - there are two templates matching the "Extensions" tag, and only one template gets applied. So right now I have the nesting covered, but need to figure out the expansion. The template should ultimately accomplish this instead (one match statement per megafield, then one select statement per sub-field):

<xsl:template match="//ns1:Trackpoint/ns1:Extensions">
  <ExtensionsTPXSpeed>
    <xsl:value-of select="./ns3:TPX/ns3:Speed"/>
  </ExtensionsTPXSpeed>
  <ExtensionsTPXRunCadence>
    <xsl:value-of select="./ns3:TPX/ns3:RunCadence"/>
  </ExtensionsTPXRunCadence>
</xsl:template>
@aaron-schroeder aaron-schroeder added enhancement New feature or request Refactor Internal refactoring of code labels Feb 16, 2023
@aaron-schroeder aaron-schroeder changed the title Add generic xsd file parsing Encapsulate general xml file parsing Feb 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Refactor Internal refactoring of code
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant