Minimum edit distance and shared prefix length make some potentially problematic assumptions #18

sts10 · 2022-08-19T21:20:27Z

Both --minimum-edit-distance and --shared-prefix-length always prefer shorter words when choosing between 2 words.

When I was implementing these two features, I figured that was a fine assumptions. But now that I think about it, if the inputted word list is sorted by desirability, this desirability information is thrown away when Tidy carries out either --minimum-edit-distance and --shared-prefix-length.

I could re-write these two functions to prefer the word that is higher on the input list, BUT for an alphabetically sorted input list, this output might be weirdly skewed toward the front of the alphabet.

This could be a reason to implement a --is-sorted boolean flag, so users could tell Tidy whether the inputted list is sorted by some desirable metric.

The text was updated successfully, but these errors were encountered:

bwbug · 2022-08-21T03:18:49Z

As a work-around for avoiding skewing of alphabetically sorted lists, a user could specify --take-rand with an argument equal to the full list size. Or you could automatically randomize the list before running the --minimum-edit-distance and --shared-prefix-length, unless the user has specified -O.

Some variation of an --is-sorted boolean may turn out to be a good solution, but I wanted to throw out the above alternatives for consideration as well. At this time, I'm not making a case for one alternative over another.

jpgoldberg · 2022-11-14T06:02:53Z

"desirability" is going to make everything more complicated. Do you feel that it is worth it? I do see a value having more familiar words more likely, but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists. So if you are aiming for 7776, just use an input of the 10,000 most common words without actually rating desirability among them.

sts10 · 2022-11-14T12:47:54Z

"desirability" is going to make everything more complicated.

I think I can keep it relatively uncomplicated by using the order of the inputted word list as a proxy for desirability. This would help me avoid using something like struct Word { s: String, desirability: uint32 } throughout the code base.

In practice, I'd add a --is-sorted boolean flag. If that was set to true by the user, whenever Tidy executes a filter that requires an arbitrary choice between two words (I think minimum edit distance and shared prefix length are the only two so far, hence this issue), it would prefer whichever word was first in the given input order. Otherwise, it could continue preferring the shorter word, as a loose stand-in for desirability.

but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists.

Yeah, I'm coming around this view... (And Tidy has the --take-first and --whittle-to options to make this truncate process easier.)

I made Tidy while working with large word lists that were sorted by frequency, whether from Google Books or Wikipedia. These lists are long and toward the end we get strange words like "aude" and "paniculate". I think I got a little caught up with the idea of automating everything, such that I wouldn't even have to arbitrarily cut the input list down before proceeding. But in my experience actually making word lists, there are plenty of "human"/"arbitrary" choices that need to be made to make a good one.

sts10 added question Further information is requested and removed question Further information is requested labels Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum edit distance and shared prefix length make some potentially problematic assumptions #18

Minimum edit distance and shared prefix length make some potentially problematic assumptions #18

sts10 commented Aug 19, 2022

bwbug commented Aug 21, 2022

jpgoldberg commented Nov 14, 2022

sts10 commented Nov 14, 2022 •

edited

Loading

Minimum edit distance and shared prefix length make some potentially problematic assumptions #18

Minimum edit distance and shared prefix length make some potentially problematic assumptions #18

Comments

sts10 commented Aug 19, 2022

bwbug commented Aug 21, 2022

jpgoldberg commented Nov 14, 2022

sts10 commented Nov 14, 2022 • edited Loading

sts10 commented Nov 14, 2022 •

edited

Loading