Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Respect language based style names on reading Word files #2597

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Apr 2, 2024

  1. [BUGFIX] Respect language based style names on reading Word files

    Microsoft Office saves Office document with language based style
    mappings for default styles. For example, if a german based Word
    version is used, it writes following to the `word/styles.xml` in
    the container archive (*.docs):
    
    ```
    <w:style w:type="paragraph" w:styleId="berschrift1">
      <w:name w:val="heading 1"/>
      ....
      </w:style>
    ```
    
    versus for a english based version it would be:
    
    ```
    <w:style w:type="paragraph" w:styleId="Heading1">
      <w:name w:val="heading 1"/>
      ...
    </w:style>
    ```
    
    The value of `<w:name />` defines the internal native code
    identifier, whereas the `w:styleId` attribute on the outer
    `<w:style />` tag would describe the virtual or alias name.
    
    Later parsing of the document structure, for example the
    paragraphs, references the alias (`w:styleId`) name of a
    style. The reader code uses hardcoded RegEx matchings in
    a case-insensitive manner but using the englisch speaking
    variant (`Header\s+d`) - on the language based one, which
    would not match at all.
    
    Therefore, multiple tasks need to be done and contained
    in this change:
    
    * A alias map is implementend and used to register title
      aliases. Along with this corresponding lookup method is
      added.
    * Use the lookup method to resolve for alias where the
      hardcoded language RegEx is needed to be used.
    * Gathering all style alias names during reading the
      wordfile styles settings for all possible styles.
    sbuerk committed Apr 2, 2024
    Configuration menu
    Copy the full SHA
    13a5d65 View commit details
    Browse the repository at this point in the history