Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot retrieve content of pages that are >100MB. #4543

Closed
luyizhao opened this issue Jun 7, 2019 · 1 comment · Fixed by #4571
Closed

Cannot retrieve content of pages that are >100MB. #4543

luyizhao opened this issue Jun 7, 2019 · 1 comment · Fixed by #4571

Comments

@luyizhao
Copy link

luyizhao commented Jun 7, 2019

Puppeteer >=1.11.0 cannot retrieve the content of HTML pages that are >100MB.

In Puppeteer 1.11.0, the ws dependency was bumped to ^6.1.0 (See: d3f50ea). However, ws introduced a breaking change in 6.0.0 by adding a maxPayload option that capped WebSocket message sizes to 100MB by default (See: websockets/ws#1402) and Puppeteer relies on the ws defaults (See: https://github.com/GoogleChrome/puppeteer/blob/9c4b6d06e214946e38999b9325c7d10152a1cf69/lib/WebSocketTransport.js#L28)

Suggested Fix

Increasing the maxPayload option in ws allows us to circumvent this issue. I would suggest allowing users to customize ws options in Puppeteer to set a custom maxPayload size. Thanks!

Steps to reproduce

Environment:

  • Puppeteer version: >=1.11.0

Steps to reproduce

  1. Create a test.html file >100MB in size. Quick example in Python:
contents = 'a' * 100 * 1024 * 1024
styling = '<style type="text/css">body{display:none}</style>'
with open('test.html', 'w') as file:
    file.write(
        f'<!DOCTYPE html><html><head>{styling}</head><body>{contents}</body></html>'
    )

(Note: display set to none in test file purely to speed up loading of page. It can be omitted, but please note Puppeteer will take longer to load the page when trying to reproduce this issue).

  1. Open the large test.html file and try to retrieve page content (Add in path to your test.html file in below script):
const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch({ args: ["--no-sandbox"] });
    const page = await browser.newPage();
    await page.goto("file://{path to test.html}", { timeout: 100000 });
    content = await page.content();
    await page.close();
    browser.close();
})();

What is the expected result?

Page content is retrievable without error.

What happens instead?

We see the following traceback:

(node:86223) UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Target closed.
    at Promise (~/node_modules/puppeteer/lib/Connection.js:183:56)
    at new Promise (<anonymous>)
    at CDPSession.send (~/node_modules/puppeteer/lib/Connection.js:182:12)
    at ExecutionContext.evaluateHandle (~/node_modules/puppeteer/lib/ExecutionContext.js:106:44)
    at ExecutionContext.<anonymous> (~/node_modules/puppeteer/lib/helper.js:111:23)
    at ExecutionContext.evaluate (~/node_modules/puppeteer/lib/ExecutionContext.js:48:31)
    at ExecutionContext.<anonymous> (~/node_modules/puppeteer/lib/helper.js:111:23)
    at DOMWorld.evaluate (~/node_modules/puppeteer/lib/DOMWorld.js:112:20)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
  -- ASYNC --
    at Frame.<anonymous> (~/node_modules/puppeteer/lib/helper.js:110:27)
    at Page.content (~/node_modules/puppeteer/lib/Page.js:612:49)
    at Page.<anonymous> (~/node_modules/puppeteer/lib/helper.js:111:23)
    at ~/test.js:7:26
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
@luyizhao luyizhao changed the title Cannot retrieve content of HTML pages that are >100MB. Cannot retrieve content of pages that are >100MB. Jun 7, 2019
aslushnikov added a commit that referenced this issue Jun 14, 2019
…4571)

This is the max message size that DevTools can emit over the DevTools
protocol: https://cs.chromium.org/chromium/src/content/browser/devtools/devtools_http_handler.cc?type=cs&q=kSendBufferSizeForDevTools&sq=package:chromium&g=0&l=83

Test is failing on firefox since Firefox crashes when allocating 100Mb string.

Fix #4543
@luyizhao
Copy link
Author

Thanks for the quick fix @aslushnikov !

Just as an fyi, related to your patch it appears sending content >256MB using pipe to connect to the browser in v1.18.0 or v.1.16.0 works fine as an alternative (to bypass the DevTools cap you mentioned in 62733a2). In v1.17.0, Puppeteer appears to hang though, regardless of the browser connection method - from looking at the changelogs between releases I suspect this is a Chromium issue.

In any case, I don't know if it's worth documenting, but I thought I'd mention these findings just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant