Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language attribute to canvas #17770

Merged
merged 1 commit into from May 21, 2024

Conversation

Aditi-1400
Copy link
Contributor

@Aditi-1400 Aditi-1400 commented Mar 4, 2024

Fixes issue #16843.

In certain cases, the text layer was misaligned due to a difference
between the lang attribute of the viewer and the canvas. This commit addresses the problem by adding the lang attribute to the canvas. The issue was caused because PDF.js uses serif/sans-serif fonts to generate the text layer and relies on system fonts. The difference in the lang attribute led to different fonts being picked, causing the misalignment.

Before the change:
image

After the change:
image

Thanks, @nicolo-ribaudo for helping with this.

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this change isn't correct.
First of all the textLayer can be used outside of the viewer, a use-case that this patch completely breaks. Secondly you're not really "allowed" to do inline DOM lookups like this, hence a different solution would be required.

@calixteman
Copy link
Contributor

I'd say that the right thing to do is to attach the canvas at the right place here:
https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js#L80

@Aditi-1400
Copy link
Contributor Author

Sorry, but this change isn't correct. First of all the textLayer can be used outside of the viewer, a use-case that this patch completely breaks. Secondly you're not really "allowed" to do inline DOM lookups like this, hence a different solution would be required.

Makes sense, um, I believe to fix this, getCtx() could be changed to set the lang attribute of the canvas to be the same as that of the viewer?

@Aditi-1400 Aditi-1400 changed the title Append canvas to div#viewer instead of document Add language attribute to canvas Mar 25, 2024
@Aditi-1400
Copy link
Contributor Author

@calixteman @Snuffleupagus Hey, I updated the pull request, so we assign the same language as the viewer to the canvas, this fixes the text layer issue and also the problems that @calixteman mentioned above.

@Aditi-1400 Aditi-1400 force-pushed the fix-issue-16843 branch 2 times, most recently from 23303ee to f2fbf37 Compare March 25, 2024 15:01
@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Apr 10, 2024

I'm not even sure if this is the "correct" solution, given e.g. #17770 (comment) above? /cc @calixteman
Also, generally speaking, does the solution need to be this "complicated" or can it be simplified?

(Assuming this is a good solution, the actual implementation needs a little bit of work to improve things.)

@nicolo-ribaudo
Copy link
Contributor

I wonder what "the right place" would be. Currently the language is set per-page, while there is only one canvas injected in the document.

Are all pages of a PDF guaranteed to have the same language? If so we could inject the canvas in the first page's text layer, and re-use it for all the pages. If not, the alternative is to inject the canvas in every page.

@calixteman
Copy link
Contributor

I suppose we could pass the viewer to the text layer in order to attach the canvas to it but it'd almost the same amount of changes. Or we add something like const element = container.closest("[lang]:not([lang=''])") || document.body; when creating the canvas context (but only once the lang attribute has been added).
I think my preference would be to pass the lang as it's done in this patch... but usually I rely on @Snuffleupagus opinion for this kind of questions.
@Snuffleupagus, could you imagine a better/simpler solution ?

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is not a good solution in its current form, since viewer initialization and first rendering is now blocked on Metadata fetching and parsing despite that being unrelated to actual rendering.

This will thus affect general viewer loading performance for all PDF documents, given that /Info-dictionaries and /Metadata-streams are often placed at the end of the file. Especially for e.g. linearized PDFs this could thus have a noticeable impact.

Hence my first idea, to work-around these problems, could be to also include the /Lang-data in the textContent such that it becomes directly available in the src/display/text_layer.js file. (I can try outlining a patch for this part during the weekend).

@calixteman
Copy link
Contributor

Unfortunately this is not a good solution in its current form, since viewer initialization and first rendering is now blocked on Metadata fetching and parsing despite that being unrelated to actual rendering.

This will thus affect general viewer loading performance for all PDF documents, given that /Info-dictionaries and /Metadata-streams are often placed at the end of the file. Especially for e.g. linearized PDFs this could thus have a noticeable impact.

That's a fair point.

Hence my first idea, to work-around these problems, could be to also include the /Lang-data in the textContent such that it becomes directly available in the src/display/text_layer.js file. (I can try outlining a patch for this part during the weekend).

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use).
I suppose that having a lang isn't that frequent, if this assertion is correct, maybe we could just recompute the text layer (just the scale factors) when lang !== "", else we could create a text layer without any scale factors and when the lang is set then compute them.
That said I should really work again on that patch I've to use the pdf fonts in the text layer because I guess that changing the lang attribute won't impact the fonts used in the text layer.

@Aditi-1400
Copy link
Contributor Author

Aditi-1400 commented Apr 11, 2024

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use). I suppose that having a lang isn't that frequent, if this assertion is correct, maybe we could just recompute the text layer (just the scale factors) when lang !== "", else we could create a text layer without any scale factors and when the lang is set then compute them. That said I should really work again on that patch I've to use the pdf fonts in the text layer because I guess that changing the lang attribute won't impact the fonts used in the text layer.

I did do some experimentation to make the text layer font the same as the document font, but I didn't go ahead with that since the font details are cleaned up after a few seconds here unless fontExtraProperties is enabled. However, if this is a preferred approach, I could look more into this, and help with completing the patch?

@calixteman
Copy link
Contributor

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use). I suppose that having a lang isn't that frequent, if this assertion is correct, maybe we could just recompute the text layer (just the scale factors) when lang !== "", else we could create a text layer without any scale factors and when the lang is set then compute them. That said I should really work again on that patch I've to use the pdf fonts in the text layer because I guess that changing the lang attribute won't impact the fonts used in the text layer.

I did do some experimentation to make the text layer font the same as the document font, but I didn't go ahead with that since the font details are cleaned up after a few seconds here unless fontExtraProperties is enabled. However, if this is a preferred approach, I could look more into this, and help with completing the patch?

The cache thing isn't really a problem here: we could just disable it, the main problem is that we've to deal with a lot of fonts which aren't always in a good state to be used as is in the text layer.
For example some of them have no width table or have a widths table different of the one used by the pdf... so the fonts must be amended to have the correct data.
And I had some issues with some pdf in arabic because they're using some ligatures which aren't specified in font itself.
I must rebase the patch I've and I'll share it with you.
That said, after having thought about that, in some cases (like with type3 fonts) we still need to use the text layer as it is right now.
So it's worth having a fix for this issue without refactoring the text layer stuff.

@Snuffleupagus
Copy link
Collaborator

Passing the lang in the textContent will still induce to fetch and parse the metadata probably too soon and then induce a less noticeable impact but still an impact (and it should slightly increase the memory use).

My idea would only require invoking

pdf.js/src/core/catalog.js

Lines 117 to 124 in e78ce74

get lang() {
const lang = this._catDict.get("Lang");
return shadow(
this,
"lang",
typeof lang === "string" ? stringToPDFString(lang) : null
);
}
and not until fetching the textContent.

@calixteman
Copy link
Contributor

Can't we just get a promise with the lang and pass it when building the text layer and get its value only when the first chunk is rendered, something like that ?

@timvandermeij
Copy link
Contributor

Now that #17941 is in place, this patch can be updated to make use of the new lang attribute, which should hopefully simplify this patch.

@Aditi-1400
Copy link
Contributor Author

Aditi-1400 commented May 17, 2024

Now that #17941 is in place, this patch can be updated to make use of the new lang attribute, which should hopefully simplify this patch

Updated the pull request using the new lang attribute.
Also, makeref is failing so I didn't run snapshot tests locally, but this shouldn't affect rendering

src/display/text_layer.js Outdated Show resolved Hide resolved
src/display/text_layer.js Outdated Show resolved Hide resolved
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a text reference test, using the first page of the affected PDF document.

And please improve the commit message to explain what's being changed and why, i.e. you'll want more than a single line, since it's currently difficult to understand what the patch does without looking at the code.

src/display/text_layer.js Outdated Show resolved Hide resolved
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this will now require another rebase, sorry about that!

test/integration/text_layer_spec.mjs Outdated Show resolved Hide resolved
web/Misaligned Text Layer.pdf Outdated Show resolved Hide resolved
src/display/text_layer.js Outdated Show resolved Hide resolved
@Aditi-1400 Aditi-1400 force-pushed the fix-issue-16843 branch 3 times, most recently from a56a26a to b17cce5 Compare May 21, 2024 12:06
@Aditi-1400
Copy link
Contributor Author

Unfortunately this will now require another rebase, sorry about that!

On it!

Fixes issue mozilla#16843.
In certain cases, the text layer was misaligned
due to a difference between the `lang` attribute
of the viewer and the canvas. This commit addresses
the problem by adding the `lang` attribute to the canvas.

The issue was caused because PDF.js uses serif/sans-serif
fonts to generate the text layer and relies on system fonts.
The difference in the `lang` attribute led to different fonts
being picked, causing the misalignment.
@Snuffleupagus
Copy link
Collaborator

/botio test

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/030f1b06ca59e29/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/f990e5b7a010e09/output.txt

@Snuffleupagus Snuffleupagus linked an issue May 21, 2024 that may be closed by this pull request
@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/030f1b06ca59e29/output.txt

Total script time: 6.64 mins

  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED

Image differences available at: http://54.193.163.58:8877/030f1b06ca59e29/reftest-analyzer.html#web=eq.log

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/f990e5b7a010e09/output.txt

Total script time: 27.76 mins

  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 20
  different first/second rendering: 2

Image differences available at: http://54.241.84.105:8877/f990e5b7a010e09/reftest-analyzer.html#web=eq.log

@timvandermeij
Copy link
Contributor

/botio-windows test

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 1

Live output at: http://54.193.163.58:8877/dd43222cb7c27d0/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/dd43222cb7c27d0/output.txt

Total script time: 40.77 mins

  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 5

Image differences available at: http://54.193.163.58:8877/dd43222cb7c27d0/reftest-analyzer.html#web=eq.log

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, thank you!

@Snuffleupagus Snuffleupagus merged commit 2a52fda into mozilla:master May 21, 2024
9 checks passed
@Snuffleupagus
Copy link
Collaborator

/botio makeref

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/30f062f551ad04a/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_makeref from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/13b047407e28182/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/30f062f551ad04a/output.txt

Total script time: 20.10 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/13b047407e28182/output.txt

Total script time: 25.14 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misaligned text layer
6 participants