WordPress.org

GlotPress

Opened 23 months ago

Closed 3 months ago

Last modified 3 months ago

#263 closed enhancement (fixed)

On import, detect when a string has minor changes and keep the original translations.

Reported by: yoavf Owned by: ocean90
Milestone: 1.0 Priority: major
Version: 0.1 Component: General
Keywords: has-patch Cc:

Description

If a string has a minor change before an import (say, a dot added to the end of a sentence), all existing translations will be useless, since the original will be marked as obsolete.

GlotPress should detect minor changes on original on import, and instead mark existing translations as fuzzy.

Attachments (11)

fuzzy_similar_translation.diff (4.6 KB) - added by yoavf 23 months ago.
263.diff (5.8 KB) - added by yoavf 23 months ago.
263.2.diff (7.4 KB) - added by yoavf 20 months ago.
263.3.diff (12.7 KB) - added by yoavf 16 months ago.
263.3.2.diff (11.7 KB) - added by yoavf 16 months ago.
previous file had some duplication
263.4.diff (13.2 KB) - added by yoavf 16 months ago.
263.5.diff (15.3 KB) - added by yoavf 16 months ago.
263.6.diff (15.4 KB) - added by yoavf 15 months ago.
Add an action hook on string similarity
263.7.diff (10.8 KB) - added by yoavf 15 months ago.
Core changes for this ticket, after helpers have been committed.
263.8.diff (10.5 KB) - added by yoavf 14 months ago.
refresh, fix a couple of minor bugs in closest_original()
263.9.diff (10.7 KB) - added by yoavf 3 months ago.

Download all attachments as: .zip

Change History (44)

comment:1 @yoavf23 months ago

fuzzy_similar_translation.diff uses a levenshtein based comparison to go over new originals and compare them with previously existing originals. If a string is 75% similar to a previous string (now missing), replace it instead and mark existing translations as fuzzy.

TODO:

  • use php's similar_text() for strings longer than 255 chars (levensthein's practical limt).
  • Make the 75% default filterable.

Comments welcome :)

Last edited 23 months ago by yoavf (previous) (diff)

comment:2 follow-up: @markoheijnen23 months ago

Awesome, I still was planning to look into this. I will close #93 then as duplicated.

comment:3 follow-up: @nacin23 months ago

Both levenshtein() and similar_text() are very slow. What kind of an effect might this have on speed of imports? WordPress.org can handle it, but it would be good to know what the slow-down will look like.

If a string is less than, say, 15 or 20 characters, is fuzziness worth it? I imagine that the 75% should be a sliding scale. A 75% (or even 80%) match of 20 characters would probably result in false positives, while 75% (or less) might be OK when dealing with particularly long strings.

comment:4 follow-up: @nacin23 months ago

It would be good if this was entirely driven by filters. So here's string A, here's string B, please return true/false as to whether the string is fuzzy. Then a function could hook in and decide if the strings are close enough. (Inside this callback could be a filter controlling the 75% threshold.)

One benefit of a hook is that on WordPress.org I would want to log all 50%+ matches over the course of a number of imports, to see if the threshold is appropriate, or if it is flagging a lot of false positives (the problem here being where a string's meaning changes but translators don't pick up on it).

comment:5 in reply to: ↑ 2 @yoavf23 months ago

Replying to markoheijnen:

Awesome, I still was planning to look into this. I will close #93 then as duplicated.

Thanks, not sure how I missed #93 :)

comment:6 in reply to: ↑ 3 @yoavf23 months ago

Replying to nacin:

Both levenshtein() and similar_text() are very slow. What kind of an effect might this have on speed of imports? WordPress.org can handle it, but it would be good to know what the slow-down will look like.

I ran a few local tests.

Importing 2k strings into a 20k strings database: about 2.5 seconds added with this patch
Importing a 20k strings into a 2k strings database: about 4.5 seconds added with this patch

Of course times get higher the less matches you have between the old and new databases, but I don't think time is too much of an issue here.

I plan to test this on WP.com soon - I'll report the results here. (We do a full import every hour or so, and our main .pot file holds about 20k strings).

comment:7 in reply to: ↑ 4 @yoavf23 months ago

Replying to nacin:

If a string is less than, say, 15 or 20 characters, is fuzziness worth it? I imagine that the 75% should be a sliding scale. A 75% (or

even 80%) match of 20 characters would probably result in false positives, while 75% (or less) might be OK when dealing with particularly long strings.

I agree, I'll see about making this variable based on length.

It would be good if this was entirely driven by filters. So here's string A, here's string B, please return true/false as to whether the string is fuzzy. Then a function could hook in and decide if the strings are close enough. (Inside this callback could be a filter controlling the 75% threshold.)

One benefit of a hook is that on WordPress.org I would want to log all 50%+ matches over the course of a number of imports, to see if the threshold is appropriate, or if it is flagging a lot of false positives (the problem here being where a string's meaning changes but translators don't pick up on it).

I'll work on that.

@yoavf23 months ago

comment:8 @yoavf23 months ago

263.diff breaks things up a bit (in a somewhat complex way):

  • closest_original() compare strings lengths before doing textual comparison (filterable)
  • string_similarity() handles the actual comparison per word (using levenshtein() / similar_text() depending on length)
  • the best matching string is then passed with the similarity percentage to gp_original_is_string_similar
  • is_string_similar() does that by default

another improvement: instead of comparing to all the originals for every string, we only compare to the originals we haven't yet matched before (unset( $originals_for_comparison[$entry->key()] );)

comment:9 follow-ups: @yoavf23 months ago

Just realized it will be much more effective to only compare dropped strings with new strings, and get our fuzzies from there. I'll work on that sometimes next week.

comment:10 @markoheijnen23 months ago

I think putting this in a separate class would make sense. And yes it's more effective to only compare dropped strings. I guess the import code in a whole needs a rewrite.

comment:11 in reply to: ↑ 9 @nacin23 months ago

Replying to yoavf:

Just realized it will be much more effective to only compare dropped strings with new strings, and get our fuzzies from there. I'll work on that sometimes next week.

Not just dropped strings — probably also any other obsolete strings.

comment:12 @markoheijnen22 months ago

  • Milestone changed from Future Release to 1.0

comment:13 in reply to: ↑ 9 @yoavf20 months ago

Replying to yoavf:

I'll work on that sometimes next week.

That took a while :)

263.2 diff changes things a bit:

  • it now collects unmatched strings (no exact originals) to the possibly_added array.
  • It put all unused strings in the possibly_dropped array
  • It creates a comparison array which combines the possibly_dropped array with existing obsolete originals
  • it will loop over all the possibly_added strings, and will try find similar strings in the comparison array
  • If it finds a match, then it marks that original as active, and its existing translation as fuzzy
  • No match? it's a new original that gets added

@yoavf20 months ago

comment:14 @yoavf19 months ago

Coming back to this, I'd love some feedback. This has been running fine on WP.com for a few months now.

TODOs I already have is to add a count for fuzzy strings per language on the project page (already running on wp.com too, just need to clean it up)

comment:15 @markoheijnen18 months ago

  • Keywords gsoc added

@yoavf16 months ago

@yoavf16 months ago

previous file had some duplication

comment:17 @yoavf16 months ago

In 263.3.2diff I cleaned up the code a bit and simplified the comparison. There's now a hard limit of 5000 chars on string comparison with a custom levenshtein() function and no reliance on the slower similar_text();

Also updated the UI to display the fuzzy count on the project view, and in the import result strings.

comment:18 @yoavf16 months ago

  • Keywords 2nd-opinion added; gsoc removed

The comparison functions should probably be outside of the 'original' class. Any suggestions on where to put them?

comment:19 follow-up: @markoheijnen16 months ago

Create a new class/file for all the comparison code in gp-includes?

Last edited 16 months ago by markoheijnen (previous) (diff)

@yoavf16 months ago

comment:20 in reply to: ↑ 19 @yoavf16 months ago

Replying to markoheijnen:

Create a new class/file for all the comparison code in gp-includes?

Thanks, I think we can put them in strings.php, which I did with 263.4.diff - and also added a simple test for the similarity function. I also switched to using gp_* functions instead of mb_* ones.

@yoavf16 months ago

comment:21 @yoavf16 months ago

  • Keywords 2nd-opinion removed

263.5.diff adds a test for the fuzzy on import functionality.

comment:22 @yoavf16 months ago

hrm, the test I added works when running this single test file, but not when running the full test suite. Investigating.

Edit: To clarify, the problem is with the following call:

$translations = GP::$translation->find_many( "original_id = '$original_id' AND status = 'current'" );

When running the t/tests/tests_things/test_thing_original.php file directly, it returns an array of translation objects.
When running the whole test suites, it retain an array of stdClass objects.

Last edited 16 months ago by yoavf (previous) (diff)

comment:23 @yoavf16 months ago

With #298 fixed, tests are now ok too.

comment:24 @yoavf15 months ago

In 877:

translation sets: introduce the fuzzy_count method and non-db property so we can easily access it when needed. See #263

@yoavf15 months ago

Add an action hook on string similarity

comment:25 @yoavf15 months ago

In 881:

String functions: introduce a couple of functions to measure string similiarity.

gp_levenshtein() lets us compare strings longer than 255 bytes (php native levenshtein limit).
gp_string_similarity() is a wrapper arounds it that returns a score between 0 (no similiarty) and 1 (same string), and makes sure we're not running comparisons on strings longer than 5000 chars (arbirary limit for performance purposes)
See #263

comment:26 @yoavf15 months ago

In 882:

String functions: unit tests, follow-up to r881. See #263

comment:27 @yoavf15 months ago

In 883:

Project view: show fuzzy count on translated sets, see #263

@yoavf15 months ago

Core changes for this ticket, after helpers have been committed.

@yoavf14 months ago

refresh, fix a couple of minor bugs in closest_original()

comment:28 @slackbot8 months ago

This ticket was mentioned in Slack in #polyglots by markoheijnen. View the logs.

comment:29 @slackbot4 months ago

This ticket was mentioned in Slack in #glotpress by ocean90. View the logs.

@yoavf3 months ago

comment:30 @yoavf3 months ago

263.9.diff is refreshed, fixes a copy paste filter name, and brings the min similarity score to 0.8 which we've found to be better on WP.com.

comment:31 @slackbot3 months ago

This ticket was mentioned in Slack in #glotpress by yoavf. View the logs.

comment:32 @ocean903 months ago

  • Owner set to ocean90
  • Resolution set to fixed
  • Status changed from new to closed

In 1032:

On import, detect when a string has minor changes and keep the original translations. Existing translations will be marked as fuzzy.

props yoavf.
fixes #263.

comment:33 @slackbot3 months ago

This ticket was mentioned in Slack in #glotpress by ocean90. View the logs.

Note: See TracTickets for help on using tickets.