Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum edit distance and shared prefix length make some potentially problematic assumptions #18

Open
sts10 opened this issue Aug 19, 2022 · 3 comments

Comments

@sts10
Copy link
Owner

sts10 commented Aug 19, 2022

Both --minimum-edit-distance and --shared-prefix-length always prefer shorter words when choosing between 2 words.

When I was implementing these two features, I figured that was a fine assumptions. But now that I think about it, if the inputted word list is sorted by desirability, this desirability information is thrown away when Tidy carries out either --minimum-edit-distance and --shared-prefix-length.

I could re-write these two functions to prefer the word that is higher on the input list, BUT for an alphabetically sorted input list, this output might be weirdly skewed toward the front of the alphabet.

This could be a reason to implement a --is-sorted boolean flag, so users could tell Tidy whether the inputted list is sorted by some desirable metric.

@bwbug
Copy link

bwbug commented Aug 21, 2022

As a work-around for avoiding skewing of alphabetically sorted lists, a user could specify --take-rand with an argument equal to the full list size. Or you could automatically randomize the list before running the --minimum-edit-distance and --shared-prefix-length, unless the user has specified -O.

Some variation of an --is-sorted boolean may turn out to be a good solution, but I wanted to throw out the above alternatives for consideration as well. At this time, I'm not making a case for one alternative over another.

@jpgoldberg
Copy link

"desirability" is going to make everything more complicated. Do you feel that it is worth it? I do see a value having more familiar words more likely, but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists. So if you are aiming for 7776, just use an input of the 10,000 most common words without actually rating desirability among them.

@sts10
Copy link
Owner Author

sts10 commented Nov 14, 2022

"desirability" is going to make everything more complicated.

I think I can keep it relatively uncomplicated by using the order of the inputted word list as a proxy for desirability. This would help me avoid using something like struct Word { s: String, desirability: uint32 } throughout the code base.

In practice, I'd add a --is-sorted boolean flag. If that was set to true by the user, whenever Tidy executes a filter that requires an arbitrary choice between two words (I think minimum edit distance and shared prefix length are the only two so far, hence this issue), it would prefer whichever word was first in the given input order. Otherwise, it could continue preferring the shorter word, as a loose stand-in for desirability.

but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists.

Yeah, I'm coming around this view... (And Tidy has the --take-first and --whittle-to options to make this truncate process easier.)

I made Tidy while working with large word lists that were sorted by frequency, whether from Google Books or Wikipedia. These lists are long and toward the end we get strange words like "aude" and "paniculate". I think I got a little caught up with the idea of automating everything, such that I wouldn't even have to arbitrarily cut the input list down before proceeding. But in my experience actually making word lists, there are plenty of "human"/"arbitrary" choices that need to be made to make a good one.

@sts10 sts10 added question Further information is requested and removed question Further information is requested labels Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants