Replies: 3 comments
-
Hi @sxilderik, thank you for opening this discussion. Well, the statistical approach of my library which uses relative frequencies of character ngrams is sometimes misguided. Mixed-language texts are the hardest to identify correctly. I'm sure that Google additionally uses dictionaries to identify certain words which explains why Google succeeds and Lingua fails here. But it turns out that the summed and logarithmized probabilities for the different ngram lengths are pretty close to each other:
For ngrams of length 1 and 5, French is more likely than English. But for ngrams of length 2, 3 and 4, English is more likely than French. The French words in your sentence do not exhibit any specific ngrams that are typical of French. There are no accents either. If those were there, I'm sure the detector would decide in favor of French. What is your overall experience with my library? Do you have a lot of incorrectly detected sentences? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply! But maybe I am mistaken, and this info is already available somewhere I did not look. |
Beta Was this translation helpful? Give feedback.
-
Well, I found the API computeLanguageConfidenceValues. I use this one now, keeping only the languages with 90% confidence or more. If one of these languages is French, then I consider the sentence for further testings. |
Beta Was this translation helpful? Give feedback.
-
« Ce connard impose un meeting sur le rooftop asap » which is correctly identified as French by translate.google.com is identified as English by lingua.
I plan to use speech recognition to apply GDPR on french texts only. Lingua would wrongly ignore that sentence, which is a no go for me…
NB: this sentence is definitely french, those english words are – alas – of common use in business or office French.
Beta Was this translation helpful? Give feedback.
All reactions