Encapsulate general xml file parsing #22

aaron-schroeder · 2023-02-16T19:35:38Z

Rather than me assuming a specific type of xml file structure (eg tcx Activity), let's just read in whatever elements happen to be in the file, and we can figure out what to do with the data later. This more closely aligns with the way .fit files are read in with fitparse (although that relies on a schema-like profile file).

Ideas

names :  list-like, optional
    Column names for DataFrame of parsed XML data. Use this parameter to
    rename original element names and distinguish same named elements and
    attributes.

dtype : Type name or dict of column -> type, optional
    Data type for data or columns. E.g. {{'a': np.float64, 'b': np.int32,
    'c': 'Int64'}}
    Use `str` or `object` together with suitable `na_values` settings
    to preserve and not interpret dtype.
    If converters are specified, they will be applied INSTEAD
    of dtype conversion.
    .. versionadded:: 1.5.0

converters : dict, optional
    Dict of functions for converting values in certain columns. Keys can either
    be integers or column labels.

A completely different, but intuitive and straightforward, way to flatten row elements:

iterparse : dict, optional
    The nodes or attributes to retrieve in iterparsing of XML document
    as a dict with key being the name of repeating element and value being
    list of elements or attribute names that are descendants of the repeated
    element. Note: If this option is used, it will replace ``xpath`` parsing
    and unlike xpath, descendants do not need to relate to each other but can
    exist any where in document under the repeating element. This memory-
    efficient method should be used for very large XML files (500MB, 1GB, or 5GB+).
    For example, ::
        iterparse = {{"row_element": ["child_elem", "attr", "grandchild_elem"]}}
    .. versionadded:: 1.5.0

Progress

Update pandas requirement:
- 1.5, read_xml got a parse_dates option that I use.
- 1.3, read_xml was added
xslt transform result (flattened xml file)

When I check the result of the xslt transform, the new document has namespace issues. (I don't think this matters for pandas, but something to keep an eye on).

<PositionLatitudeDegrees xmlns="" xmlns:default="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"><value-of select="./default:LatitudeDegrees"/></PositionLatitudeDegrees>

Probably has to do with the namespaces in the generated xslt script (there is no default IIRC, just a namespace xlmns:default). And then the generated script has elements like PositionLatitudeDegrees with no namespace specified.

generated xslt script
Currently, a separate xsl:template element is created for each field in a multi-field field:

  <xsl:template match="/default:TrainingCenterDatabase/default:Activities/default:Activity/default:Lap/default:Track/default:Trackpoint/default:Extensions">
    <ExtensionsTPXRunCadence>
      <value-of select="./ns3:TPX/ns3:RunCadence"/>
    </ExtensionsTPXRunCadence>
  </xsl:template>
  <xsl:template match="/default:TrainingCenterDatabase/default:Activities/default:Activity/default:Lap/default:Track/default:Trackpoint/default:Extensions">
    <ExtensionsTPXSpeed>
      <value-of select="./ns3:TPX/ns3:Speed"/>
    </ExtensionsTPXSpeed>
  </xsl:template>

...which results in only one field getting copied over - there are two templates matching the "Extensions" tag, and only one template gets applied. So right now I have the nesting covered, but need to figure out the expansion. The template should ultimately accomplish this instead (one match statement per megafield, then one select statement per sub-field):

<xsl:template match="//ns1:Trackpoint/ns1:Extensions">
  <ExtensionsTPXSpeed>
    <xsl:value-of select="./ns3:TPX/ns3:Speed"/>
  </ExtensionsTPXSpeed>
  <ExtensionsTPXRunCadence>
    <xsl:value-of select="./ns3:TPX/ns3:RunCadence"/>
  </ExtensionsTPXRunCadence>
</xsl:template>

The text was updated successfully, but these errors were encountered:

aaron-schroeder added enhancement New feature or request Refactor Internal refactoring of code labels Feb 16, 2023

aaron-schroeder changed the title ~~Add generic xsd file parsing~~ Encapsulate general xml file parsing Feb 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encapsulate general xml file parsing #22

Encapsulate general xml file parsing #22

aaron-schroeder commented Feb 16, 2023 •

edited

Loading

Encapsulate general xml file parsing #22

Encapsulate general xml file parsing #22

Comments

aaron-schroeder commented Feb 16, 2023 • edited Loading

Ideas

Progress

aaron-schroeder commented Feb 16, 2023 •

edited

Loading