Add language attribute to canvas #17770

Aditi-1400 · 2024-03-04T16:55:22Z

Fixes issue #16843.

In certain cases, the text layer was misaligned due to a difference
between the lang attribute of the viewer and the canvas. This commit addresses the problem by adding the lang attribute to the canvas. The issue was caused because PDF.js uses serif/sans-serif fonts to generate the text layer and relies on system fonts. The difference in the lang attribute led to different fonts being picked, causing the misalignment.

Before the change:

After the change:

Thanks, @nicolo-ribaudo for helping with this.

Snuffleupagus

Sorry, but this change isn't correct.
First of all the textLayer can be used outside of the viewer, a use-case that this patch completely breaks. Secondly you're not really "allowed" to do inline DOM lookups like this, hence a different solution would be required.

calixteman · 2024-03-04T17:06:57Z

I'd say that the right thing to do is to attach the canvas at the right place here:
https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js#L80

Aditi-1400 · 2024-03-04T17:32:47Z

Sorry, but this change isn't correct. First of all the textLayer can be used outside of the viewer, a use-case that this patch completely breaks. Secondly you're not really "allowed" to do inline DOM lookups like this, hence a different solution would be required.

Makes sense, um, I believe to fix this, getCtx() could be changed to set the lang attribute of the canvas to be the same as that of the viewer?

Aditi-1400 · 2024-03-25T12:08:20Z

@calixteman @Snuffleupagus Hey, I updated the pull request, so we assign the same language as the viewer to the canvas, this fixes the text layer issue and also the problems that @calixteman mentioned above.

Snuffleupagus · 2024-04-10T07:30:56Z

I'm not even sure if this is the "correct" solution, given e.g. #17770 (comment) above? /cc @calixteman
Also, generally speaking, does the solution need to be this "complicated" or can it be simplified?

(Assuming this is a good solution, the actual implementation needs a little bit of work to improve things.)

nicolo-ribaudo · 2024-04-10T09:00:08Z

I wonder what "the right place" would be. Currently the language is set per-page, while there is only one canvas injected in the document.

Are all pages of a PDF guaranteed to have the same language? If so we could inject the canvas in the first page's text layer, and re-use it for all the pages. If not, the alternative is to inject the canvas in every page.

calixteman · 2024-04-10T16:11:46Z

I suppose we could pass the viewer to the text layer in order to attach the canvas to it but it'd almost the same amount of changes. Or we add something like const element = container.closest("[lang]:not([lang=''])") || document.body; when creating the canvas context (but only once the lang attribute has been added).
I think my preference would be to pass the lang as it's done in this patch... but usually I rely on @Snuffleupagus opinion for this kind of questions.
@Snuffleupagus, could you imagine a better/simpler solution ?

Snuffleupagus

Unfortunately this is not a good solution in its current form, since viewer initialization and first rendering is now blocked on Metadata fetching and parsing despite that being unrelated to actual rendering.

This will thus affect general viewer loading performance for all PDF documents, given that /Info-dictionaries and /Metadata-streams are often placed at the end of the file. Especially for e.g. linearized PDFs this could thus have a noticeable impact.

Hence my first idea, to work-around these problems, could be to also include the /Lang-data in the textContent such that it becomes directly available in the src/display/text_layer.js file. (I can try outlining a patch for this part during the weekend).

calixteman · 2024-04-11T15:10:16Z

Unfortunately this is not a good solution in its current form, since viewer initialization and first rendering is now blocked on Metadata fetching and parsing despite that being unrelated to actual rendering.

This will thus affect general viewer loading performance for all PDF documents, given that /Info-dictionaries and /Metadata-streams are often placed at the end of the file. Especially for e.g. linearized PDFs this could thus have a noticeable impact.

That's a fair point.

Hence my first idea, to work-around these problems, could be to also include the /Lang-data in the textContent such that it becomes directly available in the src/display/text_layer.js file. (I can try outlining a patch for this part during the weekend).

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use).
I suppose that having a lang isn't that frequent, if this assertion is correct, maybe we could just recompute the text layer (just the scale factors) when lang !== "", else we could create a text layer without any scale factors and when the lang is set then compute them.
That said I should really work again on that patch I've to use the pdf fonts in the text layer because I guess that changing the lang attribute won't impact the fonts used in the text layer.

Aditi-1400 · 2024-04-11T15:28:37Z

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use). I suppose that having a lang isn't that frequent, if this assertion is correct, maybe we could just recompute the text layer (just the scale factors) when lang !== "", else we could create a text layer without any scale factors and when the lang is set then compute them. That said I should really work again on that patch I've to use the pdf fonts in the text layer because I guess that changing the lang attribute won't impact the fonts used in the text layer.

I did do some experimentation to make the text layer font the same as the document font, but I didn't go ahead with that since the font details are cleaned up after a few seconds here unless fontExtraProperties is enabled. However, if this is a preferred approach, I could look more into this, and help with completing the patch?

calixteman · 2024-04-11T16:15:40Z

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use). I suppose that having a lang isn't that frequent, if this assertion is correct, maybe we could just recompute the text layer (just the scale factors) when lang !== "", else we could create a text layer without any scale factors and when the lang is set then compute them. That said I should really work again on that patch I've to use the pdf fonts in the text layer because I guess that changing the lang attribute won't impact the fonts used in the text layer.

I did do some experimentation to make the text layer font the same as the document font, but I didn't go ahead with that since the font details are cleaned up after a few seconds here unless fontExtraProperties is enabled. However, if this is a preferred approach, I could look more into this, and help with completing the patch?

The cache thing isn't really a problem here: we could just disable it, the main problem is that we've to deal with a lot of fonts which aren't always in a good state to be used as is in the text layer.
For example some of them have no width table or have a widths table different of the one used by the pdf... so the fonts must be amended to have the correct data.
And I had some issues with some pdf in arabic because they're using some ligatures which aren't specified in font itself.
I must rebase the patch I've and I'll share it with you.
That said, after having thought about that, in some cases (like with type3 fonts) we still need to use the text layer as it is right now.
So it's worth having a fix for this issue without refactoring the text layer stuff.

Snuffleupagus · 2024-04-11T16:17:36Z

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use).

My idea would only require invoking

pdf.js/src/core/catalog.js

Lines 117 to 124 in e78ce74

    
           get lang() { 
        
             const lang = this._catDict.get("Lang"); 
        
             return shadow( 
        
               this, 
        
               "lang", 
        
               typeof lang === "string" ? stringToPDFString(lang) : null 
        
             ); 
        
           }

and not until fetching the textContent.

calixteman · 2024-04-11T16:23:28Z

Can't we just get a promise with the lang and pass it when building the text layer and get its value only when the first chunk is rendered, something like that ?

timvandermeij · 2024-05-14T16:32:08Z

Now that #17941 is in place, this patch can be updated to make use of the new lang attribute, which should hopefully simplify this patch.

Aditi-1400 · 2024-05-17T09:11:34Z

Now that #17941 is in place, this patch can be updated to make use of the new lang attribute, which should hopefully simplify this patch

Updated the pull request using the new lang attribute.
Also, makeref is failing so I didn't run snapshot tests locally, but this shouldn't affect rendering

src/display/text_layer.js

Snuffleupagus

Please also add a text reference test, using the first page of the affected PDF document.

And please improve the commit message to explain what's being changed and why, i.e. you'll want more than a single line, since it's currently difficult to understand what the patch does without looking at the code.

src/display/text_layer.js

test/integration/text_layer_spec.mjs

Snuffleupagus

Unfortunately this will now require another rebase, sorry about that!

test/integration/text_layer_spec.mjs

web/Misaligned Text Layer.pdf

src/display/text_layer.js

Aditi-1400 · 2024-05-21T12:08:59Z

Unfortunately this will now require another rebase, sorry about that!

On it!

src/display/text_layer.js

test/test_manifest.json

Fixes issue mozilla#16843. In certain cases, the text layer was misaligned due to a difference between the `lang` attribute of the viewer and the canvas. This commit addresses the problem by adding the `lang` attribute to the canvas. The issue was caused because PDF.js uses serif/sans-serif fonts to generate the text layer and relies on system fonts. The difference in the `lang` attribute led to different fonts being picked, causing the misalignment.

Snuffleupagus · 2024-05-21T16:29:55Z

/botio test

moz-tools-bot · 2024-05-21T16:29:57Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/030f1b06ca59e29/output.txt

moz-tools-bot · 2024-05-21T16:29:57Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/f990e5b7a010e09/output.txt

moz-tools-bot · 2024-05-21T16:36:37Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/030f1b06ca59e29/output.txt

Total script time: 6.64 mins

Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

Image differences available at: http://54.193.163.58:8877/030f1b06ca59e29/reftest-analyzer.html#web=eq.log

moz-tools-bot · 2024-05-21T16:57:43Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/f990e5b7a010e09/output.txt

Total script time: 27.76 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 20
  different first/second rendering: 2

Image differences available at: http://54.241.84.105:8877/f990e5b7a010e09/reftest-analyzer.html#web=eq.log

timvandermeij · 2024-05-21T17:33:07Z

/botio-windows test

moz-tools-bot · 2024-05-21T17:33:10Z

From: Bot.io (Windows)

Received

Command cmd_test from @timvandermeij received. Current queue size: 1

Live output at: http://54.193.163.58:8877/dd43222cb7c27d0/output.txt

moz-tools-bot · 2024-05-21T18:18:50Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/dd43222cb7c27d0/output.txt

Total script time: 40.77 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 5

Image differences available at: http://54.193.163.58:8877/dd43222cb7c27d0/reftest-analyzer.html#web=eq.log

Snuffleupagus

r=me, thank you!

Snuffleupagus · 2024-05-21T19:36:10Z

/botio makeref

moz-tools-bot · 2024-05-21T19:36:12Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/30f062f551ad04a/output.txt

moz-tools-bot · 2024-05-21T19:36:12Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/13b047407e28182/output.txt

moz-tools-bot · 2024-05-21T19:56:19Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/30f062f551ad04a/output.txt

Total script time: 20.10 mins

Lint: Passed
Make references: Passed
Check references: Passed

moz-tools-bot · 2024-05-21T20:01:22Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/13b047407e28182/output.txt

Total script time: 25.14 mins

Lint: Passed
Make references: Passed
Check references: Passed

Snuffleupagus requested changes Mar 4, 2024

View reviewed changes

timvandermeij added the viewer label Mar 20, 2024

Aditi-1400 force-pushed the fix-issue-16843 branch from a305589 to af3b660 Compare March 25, 2024 12:00

Aditi-1400 changed the title ~~Append canvas to div#viewer instead of document~~ Add language attribute to canvas Mar 25, 2024

Aditi-1400 force-pushed the fix-issue-16843 branch 2 times, most recently from 23303ee to f2fbf37 Compare March 25, 2024 15:01

marco-c requested a review from Snuffleupagus April 9, 2024 13:38

Snuffleupagus requested changes Apr 11, 2024

View reviewed changes

Snuffleupagus mentioned this pull request May 3, 2024

[api-minor] Include the document /Lang attribute in the textContent-data #17941

Merged

Aditi-1400 closed this May 16, 2024

Aditi-1400 force-pushed the fix-issue-16843 branch from f2fbf37 to 0603d1a Compare May 16, 2024 16:28

Aditi-1400 reopened this May 17, 2024

nicolo-ribaudo reviewed May 17, 2024

View reviewed changes

src/display/text_layer.js Outdated Show resolved Hide resolved

src/display/text_layer.js Outdated Show resolved Hide resolved

Snuffleupagus added the text-selection label May 17, 2024

Snuffleupagus requested changes May 17, 2024

View reviewed changes

src/display/text_layer.js Outdated Show resolved Hide resolved

Aditi-1400 force-pushed the fix-issue-16843 branch from 4e53e30 to ac8a22f Compare May 21, 2024 10:43

Aditi-1400 commented May 21, 2024

View reviewed changes

test/integration/text_layer_spec.mjs Outdated Show resolved Hide resolved

Snuffleupagus reviewed May 21, 2024

View reviewed changes

test/integration/text_layer_spec.mjs Outdated Show resolved Hide resolved

web/Misaligned Text Layer.pdf Outdated Show resolved Hide resolved

src/display/text_layer.js Outdated Show resolved Hide resolved

Aditi-1400 force-pushed the fix-issue-16843 branch 3 times, most recently from a56a26a to b17cce5 Compare May 21, 2024 12:06

Aditi-1400 force-pushed the fix-issue-16843 branch from b17cce5 to 86f22d0 Compare May 21, 2024 12:26

nicolo-ribaudo reviewed May 21, 2024

View reviewed changes

src/display/text_layer.js Outdated Show resolved Hide resolved

Aditi-1400 force-pushed the fix-issue-16843 branch from 86f22d0 to bcfe427 Compare May 21, 2024 12:29

Snuffleupagus reviewed May 21, 2024

View reviewed changes

src/display/text_layer.js Outdated Show resolved Hide resolved

Snuffleupagus reviewed May 21, 2024

View reviewed changes

test/test_manifest.json Outdated Show resolved Hide resolved

Snuffleupagus reviewed May 21, 2024

View reviewed changes

test/test_manifest.json Outdated Show resolved Hide resolved

Aditi-1400 force-pushed the fix-issue-16843 branch from bcfe427 to 9edca0a Compare May 21, 2024 14:12

Snuffleupagus linked an issue May 21, 2024 that may be closed by this pull request

Misaligned text layer #16843

Closed

Snuffleupagus approved these changes May 21, 2024

View reviewed changes

Snuffleupagus merged commit 2a52fda into mozilla:master May 21, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add language attribute to canvas #17770

Add language attribute to canvas #17770

Aditi-1400 commented Mar 4, 2024 •

edited

Snuffleupagus left a comment

calixteman commented Mar 4, 2024

Aditi-1400 commented Mar 4, 2024

Aditi-1400 commented Mar 25, 2024

Snuffleupagus commented Apr 10, 2024 •

edited

nicolo-ribaudo commented Apr 10, 2024

calixteman commented Apr 10, 2024

Snuffleupagus left a comment

calixteman commented Apr 11, 2024

Aditi-1400 commented Apr 11, 2024 •

edited

calixteman commented Apr 11, 2024

Snuffleupagus commented Apr 11, 2024

calixteman commented Apr 11, 2024

timvandermeij commented May 14, 2024

Aditi-1400 commented May 17, 2024 •

edited

Snuffleupagus left a comment •

edited

Snuffleupagus left a comment

Aditi-1400 commented May 21, 2024

Snuffleupagus commented May 21, 2024

moz-tools-bot commented May 21, 2024

moz-tools-bot commented May 21, 2024

moz-tools-bot commented May 21, 2024

moz-tools-bot commented May 21, 2024

timvandermeij commented May 21, 2024

moz-tools-bot commented May 21, 2024

moz-tools-bot commented May 21, 2024

Snuffleupagus left a comment

Snuffleupagus commented May 21, 2024

moz-tools-bot commented May 21, 2024

moz-tools-bot commented May 21, 2024

moz-tools-bot commented May 21, 2024

moz-tools-bot commented May 21, 2024

Add language attribute to canvas #17770

Add language attribute to canvas #17770

Conversation

Aditi-1400 commented Mar 4, 2024 • edited

Snuffleupagus left a comment

Choose a reason for hiding this comment

calixteman commented Mar 4, 2024

Aditi-1400 commented Mar 4, 2024

Aditi-1400 commented Mar 25, 2024

Snuffleupagus commented Apr 10, 2024 • edited

nicolo-ribaudo commented Apr 10, 2024

calixteman commented Apr 10, 2024

Snuffleupagus left a comment

Choose a reason for hiding this comment

calixteman commented Apr 11, 2024

Aditi-1400 commented Apr 11, 2024 • edited

calixteman commented Apr 11, 2024

Snuffleupagus commented Apr 11, 2024

calixteman commented Apr 11, 2024

timvandermeij commented May 14, 2024

Aditi-1400 commented May 17, 2024 • edited

Snuffleupagus left a comment • edited

Choose a reason for hiding this comment

Snuffleupagus left a comment

Choose a reason for hiding this comment

Aditi-1400 commented May 21, 2024

Snuffleupagus commented May 21, 2024

moz-tools-bot commented May 21, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented May 21, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented May 21, 2024

From: Bot.io (Windows)

Failed

moz-tools-bot commented May 21, 2024

From: Bot.io (Linux m4)

Failed

timvandermeij commented May 21, 2024

moz-tools-bot commented May 21, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented May 21, 2024

From: Bot.io (Windows)

Failed

Snuffleupagus left a comment

Choose a reason for hiding this comment

Snuffleupagus commented May 21, 2024

moz-tools-bot commented May 21, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented May 21, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented May 21, 2024

From: Bot.io (Linux m4)

Success

moz-tools-bot commented May 21, 2024

From: Bot.io (Windows)

Success

Aditi-1400 commented Mar 4, 2024 •

edited

Snuffleupagus commented Apr 10, 2024 •

edited

Aditi-1400 commented Apr 11, 2024 •

edited

Aditi-1400 commented May 17, 2024 •

edited

Snuffleupagus left a comment •

edited