Data Extraction for Meta-Analysis: A Practical Guide
When authors don't share raw data, digitize from the published figure. A guide for systematic reviewers covering PRISMA/Cochrane requirements, Kaplan-Meier reconstruction, forest plots, and dose-response curves.
When primary study authors don’t share their raw data, digitize the published figures and reconstruct the underlying values. This is accepted practice under both the Cochrane Handbook and PRISMA 2020 reporting guidelines, provided the digitization method is reported and the digitized data is sanity-checked against any summary statistics in the text.
This guide is for systematic reviewers, meta-analysts, and HTA practitioners. It covers when digitization is appropriate, what the major guidance frameworks expect you to report, and three worked examples — Kaplan-Meier curve reconstruction, forest plot extraction, and dose-response digitization.
When digitization is appropriate
Digitize from a figure when (1) the analysis requires data the published text doesn’t tabulate, (2) the corresponding author hasn’t responded to a data request within a reasonable window (Cochrane suggests two reminders over four to six weeks), and (3) the figure itself is of sufficient quality to extract from with documentable accuracy.
If raw data is available in supplementary materials, the trial registry (ClinicalTrials.gov, ANZCTR), or an institutional repository — use it. Always check first. A clean CSV from the authors beats any digitization, however careful.
If only summary statistics (means, SDs, medians, IQRs) are needed and they appear in tables or text, transcribe them. Don’t digitize what’s already published as numbers.
Digitization is the right call when the question requires time-to-event data, dose-response curves, or distributional information that only exists in the figure. Most Kaplan-Meier reconstruction work falls into this bucket.
What PRISMA and Cochrane say
Both major frameworks accept digitized data. The reporting burden is non-trivial.
PRISMA 2020 (item 10: data collection process) requires you to describe the methods used to collect data from reports — including any automation or other tools. Digitization qualifies as a tool. Report the tool name, the version (if applicable), who did the extraction, and whether it was independently verified.
Cochrane Handbook (Chapter 5: Collecting data) is more specific. It accepts digitization for survival curves and other figure-only data, requires dual independent extraction with reconciliation, and recommends comparing extracted summary statistics against any in the text (e.g., median survival reported in the abstract). Significant discrepancies should be reported as a limitation.
MOOSE (for observational study meta-analyses) is broadly aligned with PRISMA on this point: report the method, report the tool, report the verification.
ISPOR’s good practice for indirect treatment comparisons explicitly accepts digitization of Kaplan-Meier curves using the Guyot et al. (2012) algorithm when individual patient data is unavailable. We discuss Guyot below.
The general workflow
The mechanics are the same as any other chart extraction — upload, place points, calibrate axes, export. The systematic-review-specific additions sit around those four steps.
- Decide what to extract based on the analysis plan. Don’t digitize speculatively; you’ll waste hours.
- Pre-register the digitization method in the protocol if the review is registered (PROSPERO, OSF).
- Have two reviewers extract independently for every included study.
- Reconcile discrepancies by re-extraction or discussion.
- Validate against summary statistics in the original report.
- Archive the extracted data alongside the source figure (the XLSX with the chart embedded is useful here — see chart screenshot to Excel).
For the four-step extraction mechanics that sit at the centre of this, our pillar guide on extracting data from graph images is the reference.
Worked example 1: Kaplan-Meier curve reconstruction
The most common use of digitization in systematic reviews is reconstructing individual patient data (IPD) from a published Kaplan-Meier curve. The standard method is from Guyot et al. (2012), implemented in R packages such as IPDfromKM.
The Guyot et al. method in one paragraph
Digitize the survival curve (click coordinates along the curve). Combine those coordinates with the numbers-at-risk table that almost every published KM curve includes below the X axis. Run the algorithm: it reconstructs the times and event indicators for each patient consistent with both the digitized curve and the numbers at risk. The output is an IPD-equivalent dataset you can use in Cox regression, flexible parametric models, or any other time-to-event analysis.
Worked walkthrough
Take a hypothetical phase 3 oncology trial figure. The KM curve runs from month 0 to month 36. Two arms: experimental (n=240) and control (n=238). Numbers at risk are reported at months 0, 6, 12, 18, 24, 30, 36.
Step 1: render the figure at 300 DPI. If the source is a PDF, see our PDF chart guide.
Step 2: upload to DataFromChart. Place points along the experimental arm at every visible step in the curve. KM curves are step functions — there’s a step at every event time, and you want to click at every step. For dense curves, use the color picker auto-extraction.
Step 3: calibrate. X axis: drag start to 0, end to 36 (months). Y axis: drag start to 0, end to 1.0 (survival probability). Label the axes “Time (months)” and “Survival probability” — these labels carry through to the XLSX export.
Step 4: export the digitized (time, survival) coordinates for the experimental arm. Repeat for control arm.
Step 5: feed both digitized curves plus the numbers-at-risk tables into IPDfromKM::getIPD() (R) or an equivalent implementation. The output is two reconstructed datasets, one per arm, each row a (time, event_indicator) pair.
Step 6: validate. Compute median survival from your reconstructed IPD and compare against the median reported in the trial’s text. If the trial’s abstract says median OS in the experimental arm was 18.5 months and your reconstruction says 16.2, you have a digitization problem (likely a Y-axis calibration error). Fix and re-run.
Reconstructed IPD lets you fit flexible parametric models, run a network meta-analysis on hazard ratios that vary over time, or extrapolate beyond the trial’s follow-up.
Common KM digitization pitfalls
Censoring tick marks. Some figures show censoring as tick marks on the curve. Don’t click them as data points — they’re not events, just censoring indicators. The Guyot algorithm doesn’t need you to mark them explicitly; it infers censoring from the numbers-at-risk discrepancies.
Curves that touch zero. A curve that drops to zero by the end of follow-up implies every patient had an event, which is rarely true. Check the numbers at risk — if N_at_risk is non-zero at the final timepoint, the curve should not be at zero.
Wrong reference point at t=0. Confirm the curve starts at S(0) = 1.0 and X = 0. If the published figure starts the X axis at a later time (a “landmark” analysis), report that in your methods.
Digitizing KM curves for a review? The extractor supports the color-picker workflow that handles dense step curves well. Export to XLSX so the chart sits next to the data for dual-reviewer verification.
Worked example 2: forest plot extraction
Forest plots are easier than KM curves because they encode discrete values (point estimates and CI bounds), not continuous data. But the visual encoding is precise — a 0.5mm shift in a CI bar represents real difference in your meta-analysis weights.
Walkthrough
Take a hypothetical forest plot from a published meta-analysis showing 12 trials of an intervention versus control, risk ratio scale.
Step 1: render the figure at 300 DPI.
Step 2: upload. Each trial row contains a point (the point estimate, drawn as a square) and a horizontal line (the 95% CI). You want three values per trial: lower CI, point estimate, upper CI.
Step 3: place three points per trial — one at each end of the CI line, one at the point estimate square’s center. Group as “trial_X” for each.
Step 4: calibrate the X axis. Forest plots often use a log scale for ratio measures. Calibrate at two visible powers of ten (e.g., 0.1 and 10). The Y axis is categorical (trial names) so no calibration is needed there — you’ll use the trial labels in your dataset manually.
Step 5: export. You get three rows per trial. In a spreadsheet, pivot so each trial is one row with columns lower, estimate, upper. Compute the standard error per trial from (log(upper) - log(lower)) / (2 * 1.96). You now have the inputs for a fresh meta-analysis or a sensitivity analysis on the published one.
Common forest plot pitfalls
Diamond at the bottom. The summary diamond at the bottom of a forest plot is the meta-analytic estimate, not a trial. Skip it during extraction.
Sub-group lines. Sub-group summary rows look like trial rows but represent pooled estimates. Skip them or extract them separately and label clearly.
Square size encodes weight. The size of the point estimate square is proportional to the trial’s weight in the meta-analysis. You don’t need to extract the size — extract the center and let your re-analysis compute weights from the SEs.
Worked example 3: dose-response curve
Dose-response work — common in toxicology, pharmacology, environmental epidemiology — typically requires both the curve shape and confidence bounds.
Walkthrough
Take a hypothetical published dose-response curve relating dose (mg/kg, log scale, 0.1–100) to response (% effect, 0–100%). The figure shows a sigmoid fit plus a 95% confidence band.
Step 1: render the figure at 300 DPI.
Step 2: upload. There are three things to extract: the central curve, the upper bound of the confidence band, and the lower bound.
Step 3: place points. For the central curve, click at evenly-spaced X positions (say every quarter-log: 0.1, 0.18, 0.32, 0.56, 1.0, …). For the bands, use the color picker — confidence bands are typically a lighter shade of the curve color, and the picker handles this cleanly with a moderate tolerance setting.
Step 4: calibrate. X axis is log: calibrate at 0.1 and 100, set the axis type to logarithmic. Y axis is linear: calibrate at 0 and 100. Set axis labels “Dose (mg/kg)” and “Response (%)” — those propagate to the XLSX export.
Step 5: export. You now have three (dose, response) series: central, upper, lower. Compute SE per dose as (upper - lower) / (2 * 1.96).
Step 6: validate. The original paper likely reports an ED50 (the dose at 50% response). Compute it from your digitized curve and compare. If they match within a few percent, your extraction is good.
Common dose-response pitfalls
Linear axis mistaken for log. A dose-response plot drawn on a linear X axis but with log-spaced ticks is rare but does happen in older papers. Confirm the axis type from the figure caption before calibrating.
Truncated Y axis. Some figures show Y from 20% to 80% to “zoom in” on the active region. Note the truncation in your methods — it doesn’t affect calibration but matters for downstream modeling.
Reporting standards for your methods section
A minimal reporting block for a systematic review using digitization looks like this:
Where individual patient data was unavailable, time-to-event outcomes were reconstructed from published Kaplan-Meier curves using the algorithm of Guyot et al. (2012). Curves were digitized in DataFromChart (version 1.x), with two reviewers (initials, initials) extracting independently; discrepancies above 2% in any (time, survival) pair were reconciled by re-extraction and consensus. Reconstructed median survival was validated against text-reported medians; agreement was within ±0.5 months across all included studies.
Adjust for your tool, your reviewers, your tolerance threshold, and your validation outcome. The structure (method, tool, dual extraction, reconciliation, validation) is what reviewers expect to see.
For PRISMA reporting, item 10 in the checklist is where this block lives. For Cochrane reviews, it sits in the “Data extraction and management” methods section. Cite the algorithm paper (Guyot et al. 2012 for KM, the equivalent for whatever you used) — methods reviewers know these by name.
Reproducibility tips
Digitization is reproducible only if someone else can run the same workflow and get the same numbers. Three practices make this happen.
Archive the source figure alongside the data. XLSX with the chart embedded is convenient — the figure and the digitized data sit in the same file. CSV-only workflows need a separate archived PNG.
Record the calibration values used. If you calibrated X at 0 and 36 months, write that down. The Cochrane handbook expects this level of detail in a sensitivity analysis context.
Use a fixed tool version and report it. If you re-run an extraction six months later and the tool’s calibration algorithm changed, your data may shift. Pin a version in your methods.
Run dual independent extraction. Two reviewers using the same tool on the same figure should produce results within a few percent. If they diverge by 5%+, the figure is ambiguous and the extraction itself is a limitation.
For a deeper dive into the four-step extraction mechanics that sit at the heart of all this, see our guide on extracting data from a graph image.
Tool choice for meta-analysis work
Two tools dominate the systematic review world: WebPlotDigitizer and DataFromChart. Both produce comparable accuracy on clean source images. The choice comes down to workflow.
WebPlotDigitizer is the name most methods reviewers expect to see. It’s been the field standard since the early 2010s and has been cited in thousands of methodology papers. Use it when reviewer familiarity is the dominant factor.
DataFromChart produces XLSX output with the chart image and axis labels embedded, which simplifies dual-reviewer comparison and archiving. The color-picker auto-extraction handles dense KM curves faster than manual clicking. Use it when reproducibility and reviewer-to-reviewer comparison matter more than tool-name recognition.
The full landscape — including five other tools — is in our WebPlotDigitizer alternatives roundup.
CTA
If you’re partway through a systematic review with figure-only data, the extractor covers the digitization step end-to-end and produces XLSX output suitable for dual-reviewer archiving. Open one of your included studies’ figures and try the four-step workflow.
FAQ
Is digitized data acceptable to peer reviewers?
Yes, when the method is reported transparently and the digitized data is validated against any summary statistics in the source. Both Cochrane and PRISMA explicitly accept it; ISPOR’s good-practice document for indirect comparisons explicitly endorses Kaplan-Meier digitization.
How accurate is reconstructed IPD from Kaplan-Meier curves?
When done carefully (Guyot algorithm, dual extraction, numbers-at-risk validation), reconstructed IPD reproduces median survival within 0.5 months and hazard ratios within 5% of the true IPD in most validation studies. Accuracy degrades with low-resolution figures, missing numbers-at-risk tables, or heavy censoring late in follow-up.
Do I need dual extraction?
Cochrane requires it. PRISMA recommends it. For non-Cochrane reviews, single extraction is acceptable if you validate against text-reported summary statistics, but dual extraction is best practice and not much more work.
Which tool should I cite in the methods?
Cite the digitization tool by name and version (e.g., “DataFromChart v1.x” or “WebPlotDigitizer v4.x”). For Kaplan-Meier reconstruction, also cite the algorithm paper (Guyot et al. 2012, BMC Medical Research Methodology).
What if the corresponding author has agreed to share data but hasn’t sent it yet?
Document the request and the response timeline in your protocol. If the data arrives, use it. If it doesn’t arrive within your pre-specified window, proceed with digitization and note the unsuccessful data request in your limitations.
Can I use digitized data for a network meta-analysis?
Yes. Digitized KM curves reconstructed via Guyot are routinely used in NMAs of time-to-event outcomes, especially in oncology HTA work. ISPOR’s NMA good practice document endorses the approach.
How do I handle figures with overlapping confidence bands?
Extract each band separately using the color picker. Where bands overlap, the pixel color is the blend of the two — set your tolerance carefully so the picker captures the blend region as one or the other, then visually inspect for misallocation.
What about Bayesian meta-analyses?
The digitization step is identical. Bayesian methods (e.g., flexible parametric survival models with informative priors) consume the same reconstructed IPD as frequentist methods do. The digitization just gets you to the IPD.
Is there a difference between extracting from journal PDFs versus images on a website?
Mechanically, no — both reduce to “render or screenshot, then extract.” Journal PDFs are usually higher resolution. Web-hosted figures are often optimized for screen size and lose detail. Always prefer the PDF source.
Where does my extracted data live after the review is published?
Archive the XLSX (with embedded chart) plus the calibration values used, ideally in a data repository (Dryad, Zenodo, OSF). PRISMA 2020 item 27 (data, code, materials availability) expects this disclosure.