WordPress.org

GlotPress

Opened 20 months ago

Last modified 4 weeks ago

#263 new enhancement

On import, detect when a string has minor changes and keep the original translations.

Reported by: yoavf Owned by:
Milestone: 1.0 Priority: major
Version: 0.1 Component: General
Keywords: has-patch Cc:

Description

If a string has a minor change before an import (say, a dot added to the end of a sentence), all existing translations will be useless, since the original will be marked as obsolete.

GlotPress should detect minor changes on original on import, and instead mark existing translations as fuzzy.

Attachments (10)

fuzzy_similar_translation.diff (4.6 KB) - added by yoavf 20 months ago.
263.diff (5.8 KB) - added by yoavf 20 months ago.
263.2.diff (7.4 KB) - added by yoavf 17 months ago.
263.3.diff (12.7 KB) - added by yoavf 12 months ago.
263.3.2.diff (11.7 KB) - added by yoavf 12 months ago.
previous file had some duplication
263.4.diff (13.2 KB) - added by yoavf 12 months ago.
263.5.diff (15.3 KB) - added by yoavf 12 months ago.
263.6.diff (15.4 KB) - added by yoavf 12 months ago.
Add an action hook on string similarity
263.7.diff (10.8 KB) - added by yoavf 12 months ago.
Core changes for this ticket, after helpers have been committed.
263.8.diff (10.5 KB) - added by yoavf 11 months ago.
refresh, fix a couple of minor bugs in closest_original()

Download all attachments as: .zip

Change History (39)

comment:1 @yoavf20 months ago

fuzzy_similar_translation.diff uses a levenshtein based comparison to go over new originals and compare them with previously existing originals. If a string is 75% similar to a previous string (now missing), replace it instead and mark existing translations as fuzzy.

TODO:

  • use php's similar_text() for strings longer than 255 chars (levensthein's practical limt).
  • Make the 75% default filterable.

Comments welcome :)

Last edited 20 months ago by yoavf (previous) (diff)

comment:2 follow-up: @markoheijnen20 months ago

Awesome, I still was planning to look into this. I will close #93 then as duplicated.

comment:3 follow-up: @nacin20 months ago

Both levenshtein() and similar_text() are very slow. What kind of an effect might this have on speed of imports? WordPress.org can handle it, but it would be good to know what the slow-down will look like.

If a string is less than, say, 15 or 20 characters, is fuzziness worth it? I imagine that the 75% should be a sliding scale. A 75% (or even 80%) match of 20 characters would probably result in false positives, while 75% (or less) might be OK when dealing with particularly long strings.

comment:4 follow-up: @nacin20 months ago

It would be good if this was entirely driven by filters. So here's string A, here's string B, please return true/false as to whether the string is fuzzy. Then a function could hook in and decide if the strings are close enough. (Inside this callback could be a filter controlling the 75% threshold.)

One benefit of a hook is that on WordPress.org I would want to log all 50%+ matches over the course of a number of imports, to see if the threshold is appropriate, or if it is flagging a lot of false positives (the problem here being where a string's meaning changes but translators don't pick up on it).

comment:5 in reply to: ↑ 2 @yoavf20 months ago

Replying to markoheijnen:

Awesome, I still was planning to look into this. I will close #93 then as duplicated.

Thanks, not sure how I missed #93 :)

comment:6 in reply to: ↑ 3 @yoavf20 months ago

Replying to nacin:

Both levenshtein() and similar_text() are very slow. What kind of an effect might this have on speed of imports? WordPress.org can handle it, but it would be good to know what the slow-down will look like.

I ran a few local tests.

Importing 2k strings into a 20k strings database: about 2.5 seconds added with this patch
Importing a 20k strings into a 2k strings database: about 4.5 seconds added with this patch

Of course times get higher the less matches you have between the old and new databases, but I don't think time is too much of an issue here.

I plan to test this on WP.com soon - I'll report the results here. (We do a full import every hour or so, and our main .pot file holds about 20k strings).

comment:7 in reply to: ↑ 4 @yoavf20 months ago

Replying to nacin:

If a string is less than, say, 15 or 20 characters, is fuzziness worth it? I imagine that the 75% should be a sliding scale. A 75% (or

even 80%) match of 20 characters would probably result in false positives, while 75% (or less) might be OK when dealing with particularly long strings.

I agree, I'll see about making this variable based on length.

It would be good if this was entirely driven by filters. So here's string A, here's string B, please return true/false as to whether the string is fuzzy. Then a function could hook in and decide if the strings are close enough. (Inside this callback could be a filter controlling the 75% threshold.)

One benefit of a hook is that on WordPress.org I would want to log all 50%+ matches over the course of a number of imports, to see if the threshold is appropriate, or if it is flagging a lot of false positives (the problem here being where a string's meaning changes but translators don't pick up on it).

I'll work on that.

@yoavf20 months ago

comment:8 @yoavf20 months ago

263.diff breaks things up a bit (in a somewhat complex way):

  • closest_original() compare strings lengths before doing textual comparison (filterable)
  • string_similarity() handles the actual comparison per word (using levenshtein() / similar_text() depending on length)
  • the best matching string is then passed with the similarity percentage to gp_original_is_string_similar
  • is_string_similar() does that by default

another improvement: instead of comparing to all the originals for every string, we only compare to the originals we haven't yet matched before (unset( $originals_for_comparison[$entry->key()] );)

comment:9 follow-ups: @yoavf20 months ago

Just realized it will be much more effective to only compare dropped strings with new strings, and get our fuzzies from there. I'll work on that sometimes next week.

comment:10 @markoheijnen20 months ago

I think putting this in a separate class would make sense. And yes it's more effective to only compare dropped strings. I guess the import code in a whole needs a rewrite.

comment:11 in reply to: ↑ 9 @nacin20 months ago

Replying to yoavf:

Just realized it will be much more effective to only compare dropped strings with new strings, and get our fuzzies from there. I'll work on that sometimes next week.

Not just dropped strings — probably also any other obsolete strings.

comment:12 @markoheijnen19 months ago

  • Milestone changed from Future Release to 1.0

comment:13 in reply to: ↑ 9 @yoavf17 months ago

Replying to yoavf:

I'll work on that sometimes next week.

That took a while :)

263.2 diff changes things a bit:

  • it now collects unmatched strings (no exact originals) to the possibly_added array.
  • It put all unused strings in the possibly_dropped array
  • It creates a comparison array which combines the possibly_dropped array with existing obsolete originals
  • it will loop over all the possibly_added strings, and will try find similar strings in the comparison array
  • If it finds a match, then it marks that original as active, and its existing translation as fuzzy
  • No match? it's a new original that gets added

@yoavf17 months ago

comment:14 @yoavf15 months ago

Coming back to this, I'd love some feedback. This has been running fine on WP.com for a few months now.

TODOs I already have is to add a count for fuzzy strings per language on the project page (already running on wp.com too, just need to clean it up)

comment:15 @markoheijnen14 months ago

  • Keywords gsoc added

@yoavf12 months ago

@yoavf12 months ago

previous file had some duplication

comment:17 @yoavf12 months ago

In 263.3.2diff I cleaned up the code a bit and simplified the comparison. There's now a hard limit of 5000 chars on string comparison with a custom levenshtein() function and no reliance on the slower similar_text();

Also updated the UI to display the fuzzy count on the project view, and in the import result strings.

comment:18 @yoavf12 months ago

  • Keywords 2nd-opinion added; gsoc removed

The comparison functions should probably be outside of the 'original' class. Any suggestions on where to put them?

comment:19 follow-up: @markoheijnen12 months ago

Create a new class/file for all the comparison code in gp-includes?

Last edited 12 months ago by markoheijnen (previous) (diff)

@yoavf12 months ago

comment:20 in reply to: ↑ 19 @yoavf12 months ago

Replying to markoheijnen:

Create a new class/file for all the comparison code in gp-includes?

Thanks, I think we can put them in strings.php, which I did with 263.4.diff - and also added a simple test for the similarity function. I also switched to using gp_* functions instead of mb_* ones.

@yoavf12 months ago

comment:21 @yoavf12 months ago

  • Keywords 2nd-opinion removed

263.5.diff adds a test for the fuzzy on import functionality.

comment:22 @yoavf12 months ago

hrm, the test I added works when running this single test file, but not when running the full test suite. Investigating.

Edit: To clarify, the problem is with the following call:

$translations = GP::$translation->find_many( "original_id = '$original_id' AND status = 'current'" );

When running the t/tests/tests_things/test_thing_original.php file directly, it returns an array of translation objects.
When running the whole test suites, it retain an array of stdClass objects.

Last edited 12 months ago by yoavf (previous) (diff)

comment:23 @yoavf12 months ago

With #298 fixed, tests are now ok too.

comment:24 @yoavf12 months ago

In 877:

translation sets: introduce the fuzzy_count method and non-db property so we can easily access it when needed. See #263

@yoavf12 months ago

Add an action hook on string similarity

comment:25 @yoavf12 months ago

In 881:

String functions: introduce a couple of functions to measure string similiarity.

gp_levenshtein() lets us compare strings longer than 255 bytes (php native levenshtein limit).
gp_string_similarity() is a wrapper arounds it that returns a score between 0 (no similiarty) and 1 (same string), and makes sure we're not running comparisons on strings longer than 5000 chars (arbirary limit for performance purposes)
See #263

comment:26 @yoavf12 months ago

In 882:

String functions: unit tests, follow-up to r881. See #263

comment:27 @yoavf12 months ago

In 883:

Project view: show fuzzy count on translated sets, see #263

@yoavf12 months ago

Core changes for this ticket, after helpers have been committed.

@yoavf11 months ago

refresh, fix a couple of minor bugs in closest_original()

comment:28 @slackbot4 months ago

This ticket was mentioned in Slack in #polyglots by markoheijnen. View the logs.

comment:29 @slackbot4 weeks ago

This ticket was mentioned in Slack in #glotpress by ocean90. View the logs.

Note: See TracTickets for help on using tickets.