-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimum edit distance and shared prefix length make some potentially problematic assumptions #18
Comments
As a work-around for avoiding skewing of alphabetically sorted lists, a user could specify Some variation of an |
"desirability" is going to make everything more complicated. Do you feel that it is worth it? I do see a value having more familiar words more likely, but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists. So if you are aiming for 7776, just use an input of the 10,000 most common words without actually rating desirability among them. |
I think I can keep it relatively uncomplicated by using the order of the inputted word list as a proxy for desirability. This would help me avoid using something like In practice, I'd add a
Yeah, I'm coming around this view... (And Tidy has the I made Tidy while working with large word lists that were sorted by frequency, whether from Google Books or Wikipedia. These lists are long and toward the end we get strange words like "aude" and "paniculate". I think I got a little caught up with the idea of automating everything, such that I wouldn't even have to arbitrarily cut the input list down before proceeding. But in my experience actually making word lists, there are plenty of "human"/"arbitrary" choices that need to be made to make a good one. |
Both
--minimum-edit-distance
and--shared-prefix-length
always prefer shorter words when choosing between 2 words.When I was implementing these two features, I figured that was a fine assumptions. But now that I think about it, if the inputted word list is sorted by desirability, this desirability information is thrown away when Tidy carries out either
--minimum-edit-distance
and--shared-prefix-length
.I could re-write these two functions to prefer the word that is higher on the input list, BUT for an alphabetically sorted input list, this output might be weirdly skewed toward the front of the alphabet.
This could be a reason to implement a
--is-sorted
boolean flag, so users could tell Tidy whether the inputted list is sorted by some desirable metric.The text was updated successfully, but these errors were encountered: