Illustration for "How to Extract Data from a Chart in a PDF"
· 8 min read

How to Extract Data from a Chart in a PDF

Render the PDF page as a PNG, upload to a digitizer, calibrate the axes, and export. A walkthrough with a real journal-style example and fixes for common problems.

To extract data from a chart in a PDF, render the page as a high-resolution PNG, upload it to a digitizer, calibrate two known points per axis, and export as CSV or XLSX. The PDF format itself never gets parsed — you work from a pixel image of the page.

This works because charts inside PDFs are either rasterized images (which you’d extract anyway) or vector drawings rendered at display time (which you can re-render to PNG losslessly). Either way, the digitizer sees pixels.

Three ways to get the chart out of a PDF

Pick the method that gives you the sharpest image. Sharpness directly drives extraction accuracy.

Page export as PNG (best). Open the PDF in Preview, Acrobat, or any reader with an export-to-image option. Export the page containing the chart at 300 DPI. This re-renders vector content at high resolution and gives you the cleanest possible input. Free tools: pdftoppm -r 300 paper.pdf out on the command line, or “Export As” → PNG in Preview.

Snipping tool (fast). Use macOS Screenshot (Cmd+Shift+4), Windows Snipping Tool, or your browser’s screenshot extension. Zoom the PDF to 200%+ before snipping so the captured pixels are dense. Crop tight to the plot area.

Full-page screenshot (last resort). Take a screenshot of the whole page, then crop. Lower density than the snipping approach because most screenshot tools sample at screen resolution, not source resolution.

For charts with very thin lines or dense data, the page export is meaningfully better. For a quick scan of a single value, snipping is fine.

The walkthrough

Take a hypothetical journal figure: Figure 3 from a fictional 2024 paper titled “Atmospheric methane concentrations from monitoring stations, 1990–2023.” Y axis is methane in ppb (1700–1950). X axis is year (1990–2023). The chart shows two lines: Mauna Loa and Cape Grim.

Step 1: download the PDF. Find the page with Figure 3 — say page 7.

Step 2: render that page as PNG. On the command line:

pdftoppm -r 300 -f 7 -l 7 paper.pdf methane -png

This produces methane-7.png at 300 DPI. The file will be 3–4 MB. That’s fine — the digitizer doesn’t care about file size, only pixel density.

Step 3: crop. Open methane-7.png in any image viewer, crop tight to Figure 3’s plot area (keep the axis tick labels visible — you need them for calibration). Save as methane-fig3.png.

Step 4: upload to the digitizer. In DataFromChart, drop the PNG onto the upload zone or click to browse. The chart appears in the canvas at full resolution; panzoom is available out of the box.

Step 5: place points. The chart has two series, so work one at a time. For the Mauna Loa line: click at every tick of the X axis where the line crosses (annual values). Group those points as “mauna_loa.” Then do Cape Grim and group as “cape_grim.” If the lines are dense and smooth, use the color picker for auto-extraction instead — Mauna Loa might be blue, Cape Grim red.

Step 6: calibrate. Drag the X start line to the 1990 tick and type 1990. Drag the X end line to the 2023 tick and type 2023. For Y, drag the Y start line to the 1700 ppb tick and type 1700; Y end to the 1950 ppb tick, 1950.

Step 7: export. Choose XLSX. The resulting file contains a Data sheet with all extracted (year, ppb) values per series, plus the chart image embedded for verification, plus the axis labels with units (“Year” and “ppb”).

Elapsed time: 3–4 minutes for two series with auto-extraction, 8–10 minutes for manual clicking of every annual value across both series.

Have a PDF chart open right now? Render the page as PNG, then drop it into the extractor. You’ll be exporting in five minutes.

Common problems and fixes

These cover roughly 80% of what goes wrong with PDF charts.

The chart is a low-res raster, not vector

Some publishers rasterize all figures at submission. You’ll see this when you zoom into the PDF page and the chart pixelates while the text stays crisp. Fix: there’s no fix from your side. Render at 300 DPI anyway, accept 2–3% noise on extracted values, and report the uncertainty.

The font in axis labels is hard to read

Anti-aliased text at small sizes can blur into the gridlines and confuse you about where the tick is. Fix: zoom the PDF to 400% before exporting the page, so the tick label is unambiguous when you pick a calibration value.

The chart spans two pages

Rare but real for full-width figures in two-column journals. Fix: render both pages, stitch in an image editor, then crop. Or — usually easier — find the high-resolution version of the figure in the journal’s supplementary materials.

Multiple overlapping series of the same color

PDFs sometimes use partial transparency to overlay series. The color picker will struggle because the overlap region is a different color from either series alone. Fix: extract each series with a tight color tolerance, then manually fix the overlap points by clicking them individually.

The chart has a broken Y axis (split scale)

Two visible Y axes with different ranges on the same plot. The pixel-to-value map is not continuous, so a single calibration won’t work. Fix: extract the top half and the bottom half separately. Calibrate the top half with its two Y endpoints, export, then re-calibrate for the bottom and export again. Merge the CSVs.

The PDF is a scan of a printed paper

Older papers, especially pre-2000, are scanned PDFs. Apply OCR if you need the text, but for the chart specifically just treat the scan as an image — same workflow as the rasterized-figure case. Expect higher noise.

XLSX vs CSV for PDF-extracted data

XLSX is the better default when your input is a PDF chart, because the embedded chart image becomes a built-in audit trail. Six months from now, when someone asks “where did that 1820 ppb number come from?”, you open the XLSX, see the figure, and verify the year visually. CSV gives you the numbers but throws away the provenance.

For details on what’s actually inside the XLSX file DataFromChart produces, see chart screenshot to Excel.

When this approach doesn’t work

Three cases where rendering-the-PDF-page-as-PNG is the wrong move.

The data is already in a table on a nearby page. Don’t digitize from a chart if you can extract the table. Use a PDF-to-table tool (Tabula, Camelot, or even copy-paste).

The supplementary materials contain the raw data. Always check. Authors are increasingly required to publish raw data, and a CSV in the supplements beats any digitization, however careful.

You’re trying to read patient-level data from a Kaplan-Meier curve. The chart only shows aggregate survival, not individual times-to-event. Reconstructing individual patient data requires the Guyot et al. method — covered in our meta-analysis data extraction guide.

CTA

Render your PDF page as PNG, drop it into the extractor, and you’ll have CSV or XLSX in under five minutes. The four-step workflow is identical to any other chart source — image, points, axes, data.

FAQ

Why can’t I just upload the PDF directly?

Chart digitizers work on pixels. PDFs are documents — they contain pages with mixed text, vector graphics, raster images, fonts, and metadata. Extracting “the chart” from that involves either (a) finding a raster image embedded in the PDF, which often isn’t the actual figure you see; or (b) re-rendering the page to pixels. (b) is what you should do anyway, so do it yourself in one explicit step.

What’s the best PDF page resolution to render at?

300 DPI is the sweet spot. 600 DPI is fine but produces large files with no accuracy benefit. Below 200 DPI, small details (thin lines, fine ticks) start to lose definition.

Can I extract data from a scanned (photographed) PDF?

Yes, but expect higher noise. Skew, lighting variations, and JPEG-style compression artifacts all hurt accuracy. Calibrate at the longest visible axis interval to minimize endpoint error.

Does the digitizer keep the PDF metadata?

No. The digitizer only sees pixels. Track the source PDF separately — for academic work, cite the original paper and figure number alongside the extracted data.

What if the PDF is protected (DRM/password)?

If you have legitimate access (you opened it in your reader and saw the figure), screenshot the chart and proceed. The digitizer doesn’t care that the source was protected.

How does this differ from extracting data from a chart image?

It doesn’t, after step 2. The PDF case adds one upstream step — render the page as PNG. Everything from “upload” onward is identical. The full four-step workflow is covered in our pillar guide on extracting data from graph images.