WordPress.org

GlotPress

Opened 3 years ago

Closed 16 months ago

Last modified 16 months ago

#263 closed enhancement (fixed)

On import, detect when a string has minor changes and keep the original translations.

Reported by: yoavf Owned by: ocean90
Milestone: 1.0 Priority: major
Version: 0.1 Component: General
Keywords: has-patch Cc:

Description

If a string has a minor change before an import (say, a dot added to the end of a sentence), all existing translations will be useless, since the original will be marked as obsolete.

GlotPress should detect minor changes on original on import, and instead mark existing translations as fuzzy.

Attachments (11)

fuzzy_similar_translation.diff (4.6 KB) - added by yoavf 3 years ago.
263.diff (5.8 KB) - added by yoavf 3 years ago.
263.2.diff (7.4 KB) - added by yoavf 3 years ago.
263.3.diff (12.7 KB) - added by yoavf 2 years ago.
263.3.2.diff (11.7 KB) - added by yoavf 2 years ago.
previous file had some duplication
263.4.diff (13.2 KB) - added by yoavf 2 years ago.
263.5.diff (15.3 KB) - added by yoavf 2 years ago.
263.6.diff (15.4 KB) - added by yoavf 2 years ago.
Add an action hook on string similarity
263.7.diff (10.8 KB) - added by yoavf 2 years ago.
Core changes for this ticket, after helpers have been committed.
263.8.diff (10.5 KB) - added by yoavf 2 years ago.
refresh, fix a couple of minor bugs in closest_original()
263.9.diff (10.7 KB) - added by yoavf 16 months ago.

Download all attachments as: .zip

Change History (44)

#1 @yoavf
3 years ago

fuzzy_similar_translation.diff uses a levenshtein based comparison to go over new originals and compare them with previously existing originals. If a string is 75% similar to a previous string (now missing), replace it instead and mark existing translations as fuzzy.

TODO:

  • use php's similar_text() for strings longer than 255 chars (levensthein's practical limt).
  • Make the 75% default filterable.

Comments welcome :)

Last edited 3 years ago by yoavf (previous) (diff)

#2 follow-up: @markoheijnen
3 years ago

Awesome, I still was planning to look into this. I will close #93 then as duplicated.

#3 follow-up: @nacin
3 years ago

Both levenshtein() and similar_text() are very slow. What kind of an effect might this have on speed of imports? WordPress.org can handle it, but it would be good to know what the slow-down will look like.

If a string is less than, say, 15 or 20 characters, is fuzziness worth it? I imagine that the 75% should be a sliding scale. A 75% (or even 80%) match of 20 characters would probably result in false positives, while 75% (or less) might be OK when dealing with particularly long strings.

#4 follow-up: @nacin
3 years ago

It would be good if this was entirely driven by filters. So here's string A, here's string B, please return true/false as to whether the string is fuzzy. Then a function could hook in and decide if the strings are close enough. (Inside this callback could be a filter controlling the 75% threshold.)

One benefit of a hook is that on WordPress.org I would want to log all 50%+ matches over the course of a number of imports, to see if the threshold is appropriate, or if it is flagging a lot of false positives (the problem here being where a string's meaning changes but translators don't pick up on it).

#5 in reply to: ↑ 2 @yoavf
3 years ago

Replying to markoheijnen:

Awesome, I still was planning to look into this. I will close #93 then as duplicated.

Thanks, not sure how I missed #93 :)

#6 in reply to: ↑ 3 @yoavf
3 years ago

Replying to nacin:

Both levenshtein() and similar_text() are very slow. What kind of an effect might this have on speed of imports? WordPress.org can handle it, but it would be good to know what the slow-down will look like.

I ran a few local tests.

Importing 2k strings into a 20k strings database: about 2.5 seconds added with this patch
Importing a 20k strings into a 2k strings database: about 4.5 seconds added with this patch

Of course times get higher the less matches you have between the old and new databases, but I don't think time is too much of an issue here.

I plan to test this on WP.com soon - I'll report the results here. (We do a full import every hour or so, and our main .pot file holds about 20k strings).

#7 in reply to: ↑ 4 @yoavf
3 years ago

Replying to nacin:

If a string is less than, say, 15 or 20 characters, is fuzziness worth it? I imagine that the 75% should be a sliding scale. A 75% (or

even 80%) match of 20 characters would probably result in false positives, while 75% (or less) might be OK when dealing with particularly long strings.

I agree, I'll see about making this variable based on length.

It would be good if this was entirely driven by filters. So here's string A, here's string B, please return true/false as to whether the string is fuzzy. Then a function could hook in and decide if the strings are close enough. (Inside this callback could be a filter controlling the 75% threshold.)

One benefit of a hook is that on WordPress.org I would want to log all 50%+ matches over the course of a number of imports, to see if the threshold is appropriate, or if it is flagging a lot of false positives (the problem here being where a string's meaning changes but translators don't pick up on it).

I'll work on that.

@yoavf
3 years ago

#8 @yoavf
3 years ago

263.diff breaks things up a bit (in a somewhat complex way):

  • closest_original() compare strings lengths before doing textual comparison (filterable)
  • string_similarity() handles the actual comparison per word (using levenshtein() / similar_text() depending on length)
  • the best matching string is then passed with the similarity percentage to gp_original_is_string_similar
  • is_string_similar() does that by default

another improvement: instead of comparing to all the originals for every string, we only compare to the originals we haven't yet matched before (unset( $originals_for_comparison[$entry->key()] );)

#9 follow-ups: @yoavf
3 years ago

Just realized it will be much more effective to only compare dropped strings with new strings, and get our fuzzies from there. I'll work on that sometimes next week.

#10 @markoheijnen
3 years ago

I think putting this in a separate class would make sense. And yes it's more effective to only compare dropped strings. I guess the import code in a whole needs a rewrite.

#11 in reply to: ↑ 9 @nacin
3 years ago

Replying to yoavf:

Just realized it will be much more effective to only compare dropped strings with new strings, and get our fuzzies from there. I'll work on that sometimes next week.

Not just dropped strings — probably also any other obsolete strings.

#12 @markoheijnen
3 years ago

  • Milestone changed from Future Release to 1.0

#13 in reply to: ↑ 9 @yoavf
3 years ago

Replying to yoavf:

I'll work on that sometimes next week.

That took a while :)

263.2 diff changes things a bit:

  • it now collects unmatched strings (no exact originals) to the possibly_added array.
  • It put all unused strings in the possibly_dropped array
  • It creates a comparison array which combines the possibly_dropped array with existing obsolete originals
  • it will loop over all the possibly_added strings, and will try find similar strings in the comparison array
  • If it finds a match, then it marks that original as active, and its existing translation as fuzzy
  • No match? it's a new original that gets added

@yoavf
3 years ago

#14 @yoavf
3 years ago

Coming back to this, I'd love some feedback. This has been running fine on WP.com for a few months now.

TODOs I already have is to add a count for fuzzy strings per language on the project page (already running on wp.com too, just need to clean it up)

#15 @markoheijnen
3 years ago

  • Keywords gsoc added

@yoavf
2 years ago

@yoavf
2 years ago

previous file had some duplication

#17 @yoavf
2 years ago

In 263.3.2diff I cleaned up the code a bit and simplified the comparison. There's now a hard limit of 5000 chars on string comparison with a custom levenshtein() function and no reliance on the slower similar_text();

Also updated the UI to display the fuzzy count on the project view, and in the import result strings.

#18 @yoavf
2 years ago

  • Keywords 2nd-opinion added; gsoc removed

The comparison functions should probably be outside of the 'original' class. Any suggestions on where to put them?

#19 follow-up: @markoheijnen
2 years ago

Create a new class/file for all the comparison code in gp-includes?

Last edited 2 years ago by markoheijnen (previous) (diff)

@yoavf
2 years ago

#20 in reply to: ↑ 19 @yoavf
2 years ago

Replying to markoheijnen:

Create a new class/file for all the comparison code in gp-includes?

Thanks, I think we can put them in strings.php, which I did with 263.4.diff - and also added a simple test for the similarity function. I also switched to using gp_* functions instead of mb_* ones.

@yoavf
2 years ago

#21 @yoavf
2 years ago

  • Keywords 2nd-opinion removed

263.5.diff adds a test for the fuzzy on import functionality.

#22 @yoavf
2 years ago

hrm, the test I added works when running this single test file, but not when running the full test suite. Investigating.

Edit: To clarify, the problem is with the following call:

$translations = GP::$translation->find_many( "original_id = '$original_id' AND status = 'current'" );

When running the t/tests/tests_things/test_thing_original.php file directly, it returns an array of translation objects.
When running the whole test suites, it retain an array of stdClass objects.

Last edited 2 years ago by yoavf (previous) (diff)

#23 @yoavf
2 years ago

With #298 fixed, tests are now ok too.

#24 @yoavf
2 years ago

In 877:

translation sets: introduce the fuzzy_count method and non-db property so we can easily access it when needed. See #263

@yoavf
2 years ago

Add an action hook on string similarity

#25 @yoavf
2 years ago

In 881:

String functions: introduce a couple of functions to measure string similiarity.

gp_levenshtein() lets us compare strings longer than 255 bytes (php native levenshtein limit).
gp_string_similarity() is a wrapper arounds it that returns a score between 0 (no similiarty) and 1 (same string), and makes sure we're not running comparisons on strings longer than 5000 chars (arbirary limit for performance purposes)
See #263

#26 @yoavf
2 years ago

In 882:

String functions: unit tests, follow-up to r881. See #263

#27 @yoavf
2 years ago

In 883:

Project view: show fuzzy count on translated sets, see #263

@yoavf
2 years ago

Core changes for this ticket, after helpers have been committed.

@yoavf
2 years ago

refresh, fix a couple of minor bugs in closest_original()

This ticket was mentioned in Slack in #polyglots by markoheijnen. View the logs.


21 months ago

This ticket was mentioned in Slack in #glotpress by ocean90. View the logs.


17 months ago

@yoavf
16 months ago

#30 @yoavf
16 months ago

263.9.diff is refreshed, fixes a copy paste filter name, and brings the min similarity score to 0.8 which we've found to be better on WP.com.

This ticket was mentioned in Slack in #glotpress by yoavf. View the logs.


16 months ago

#32 @ocean90
16 months ago

  • Owner set to ocean90
  • Resolution set to fixed
  • Status changed from new to closed

In 1032:

On import, detect when a string has minor changes and keep the original translations. Existing translations will be marked as fuzzy.

props yoavf.
fixes #263.

This ticket was mentioned in Slack in #glotpress by ocean90. View the logs.


16 months ago

Note: See TracTickets for help on using tickets.