Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikipedia-kyoto-japanese-english: increase REXML entity expansion limit during XML parsing #198

Merged

Conversation

otegami
Copy link
Member

@otegami otegami commented Aug 1, 2024

Using Datasets::WikipediaKyotoJapaneseEnglish#each raised an entity expansion has grown too large (RuntimeError). This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187, and Datasets::WikipediaKyotoJapaneseEnglish#each exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets.
Therefore, we temporarily increase the limit.

How to reproduce

$ cd red-datasets && bundle
$ bundle exec ruby example/wikipedia-kyoto-japanese-english.rb
...
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)
...

…it during XML parsing

Using `Datasets::WikipediaKyotoJapaneseEnglish#each` raised an `entity expansion has grown too large (RuntimeError)`.
This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187,
and `Datasets::WikipediaKyotoJapaneseEnglish#each` exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets.
Therefore, we temporarily increase the limit.

How to reproduce:

```console
$ cd red-datasets && bundle
$ bundle exec ruby example/wikipedia-kyoto-japanese-english.rb
...
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)
...
```
lib/datasets/wikipedia-kyoto-japanese-english.rb Outdated Show resolved Hide resolved
Comment on lines +122 to +128
def with_increased_entity_expansion_text_limit
default_limit = REXML::Security.entity_expansion_text_limit
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
yield
ensure
REXML::Security.entity_expansion_text_limit = default_limit
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We can use this for now but it seems that REXML::Parsers::StreamParser#entity_expansion_text_limit= should exist for local limitation change. Could you propose it to ruby/rexml?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds pretty nice. Sure!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proposed it at ruby/rexml#192.

@otegami otegami marked this pull request as ready for review August 4, 2024 13:13
@otegami
Copy link
Member Author

otegami commented Aug 4, 2024

I will fix the test in other PRs.

@kou kou merged commit a76b917 into red-data-tools:master Aug 5, 2024
0 of 12 checks passed
@kou
Copy link
Member

kou commented Aug 5, 2024

We have the same problem in Datasets::Wikipedia. Could you also fix it too?

@otegami otegami deleted the increase-rexml-entity-expansion-limit branch August 5, 2024 02:14
@otegami
Copy link
Member Author

otegami commented Aug 5, 2024

We have the same problem in Datasets::Wikipedia. Could you also fix it too?

Sure I will fix it!

@kou
Copy link
Member

kou commented Aug 23, 2024

Can we revert this because the latest REXML fixed this problem?

otegami added a commit to otegami/red-datasets that referenced this pull request Aug 24, 2024
…sion limit during XML parsing (red-data-tools#198)"

This reverts commit a76b917.

REXML has fixed the bug where `REXML::Security.entity_expansion_text_limit`
incorrectly calculated text size in both SAX and pull parsers.
Therefore, we no longer need to handle this issue within Red Datasets.

ref: https://github.com/ruby/rexml/releases/tag/v3.3.5
kou pushed a commit that referenced this pull request Aug 24, 2024
…sion limit during XML parsing (#198) (#201)

This reverts commit a76b917.

REXML has fixed the bug where
`REXML::Security.entity_expansion_text_limit` incorrectly calculated
text size in both SAX and pull parsers. Therefore, we no longer need to
handle this issue within Red Datasets.

ref: https://github.com/ruby/rexml/releases/tag/v3.3.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants