You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rather than me assuming a specific type of xml file structure (eg tcx Activity), let's just read in whatever elements happen to be in the file, and we can figure out what to do with the data later. This more closely aligns with the way .fit files are read in with fitparse (although that relies on a schema-like profile file).
Ideas
names : list-like, optional
Column names for DataFrame of parsed XML data. Use this parameter to
rename original element names and distinguish same named elements and
attributes.
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {{'a': np.float64, 'b': np.int32,
'c': 'Int64'}}
Use `str` or `object` together with suitable `na_values` settings
to preserve and not interpret dtype.
If converters are specified, they will be applied INSTEAD
of dtype conversion.
.. versionadded:: 1.5.0
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels.
A completely different, but intuitive and straightforward, way to flatten row elements:
iterparse : dict, optional
The nodes or attributes to retrieve in iterparsing of XML document
as a dict with key being the name of repeating element and value being
list of elements or attribute names that are descendants of the repeated
element. Note: If this option is used, it will replace ``xpath`` parsing
and unlike xpath, descendants do not need to relate to each other but can
exist any where in document under the repeating element. This memory-
efficient method should be used for very large XML files (500MB, 1GB, or 5GB+).
For example, ::
iterparse = {{"row_element": ["child_elem", "attr", "grandchild_elem"]}}
.. versionadded:: 1.5.0
Progress
Update pandas requirement:
1.5, read_xml got a parse_dates option that I use.
1.3, read_xml was added
xslt transform result (flattened xml file)
When I check the result of the xslt transform, the new document has namespace issues. (I don't think this matters for pandas, but something to keep an eye on).
Probably has to do with the namespaces in the generated xslt script (there is no default IIRC, just a namespace xlmns:default). And then the generated script has elements like PositionLatitudeDegrees with no namespace specified.
generated xslt script
Currently, a separate xsl:template element is created for each field in a multi-field field:
...which results in only one field getting copied over - there are two templates matching the "Extensions" tag, and only one template gets applied. So right now I have the nesting covered, but need to figure out the expansion. The template should ultimately accomplish this instead (one match statement per megafield, then one select statement per sub-field):
Rather than me assuming a specific type of xml file structure (eg tcx Activity), let's just read in whatever elements happen to be in the file, and we can figure out what to do with the data later. This more closely aligns with the way .fit files are read in with fitparse (although that relies on a schema-like profile file).
Ideas
A completely different, but intuitive and straightforward, way to flatten row elements:
Progress
Update pandas requirement:
read_xml
got a parse_dates option that I use.read_xml
was addedxslt transform result (flattened xml file)
When I check the result of the xslt transform, the new document has namespace issues. (I don't think this matters for pandas, but something to keep an eye on).
Probably has to do with the namespaces in the generated xslt script (there is no default IIRC, just a namespace xlmns:default). And then the generated script has elements like PositionLatitudeDegrees with no namespace specified.
...which results in only one field getting copied over - there are two templates matching the "Extensions" tag, and only one template gets applied. So right now I have the nesting covered, but need to figure out the expansion. The template should ultimately accomplish this instead (one match statement per megafield, then one select statement per sub-field):
The text was updated successfully, but these errors were encountered: