Abstract
We appreciate Krefeld-Schwalb et al.'s (KHJ henceforth) (1) interest in our study (2) and critical discourse spurred by their commentary. We largely agree with KHJ's theoretical arguments, which closely relate to the caveats discussed in our manuscript. Particularly, we acknowledge that the reviewed multilab studies are typically based on samples from WEIRD countries, which may entail lower heterogeneity than in other settings (3 , 4). " Put differently, our comparatively low estimates of population heterogeneity might be subject to population heterogeneity itself. " (2). However, we express reservations about KHJ's empirical claims about the magnitude of population heterogeneity, drawing on Krefeld-Schwalb et al. (KSJ henceforth) (5). It seems that KSJ intentionally studied paradigms expected to yield large meta-analytic effect sizes and " employed purpo-sive variation of the sampling frame " (1) to enhance heter-ogeneity. Olsson-Collentine et al. (6) provide evidence for a correlation between effect sizes and heterogeneity. Study 1 in KSJ documents effect size estimates of four paradigms across ten online samples and one laboratory sample. KHJ's estimates of H , ranging from 1.7 to 9.6, suggest that population heterogeneity is markedly larger than the average level of population heterogeneity observed in our sample. However, both KSJ and KHJ fail to report estimates for a fifth preregistered paradigm embedded in KSJ's study—the " local warming " effect—and omit preregistered analyses excluding inattentive participants. Table 1 summarizes heterogeneity estimates for analyses mimicking KSJ's preregistration. Revisiting KSJ's data on the local warming effect indicates that effect size estimates are homogeneous with H = 1. Moreover, population heterogeneity estimates for the five paradigms turn out to be lower after excluding inattentive participants. KHJ also reports heterogeneity estimates for studies 2 and 3 in KSJ, which are, however, based on only two samples each. Quantifying heterogeneity based on very small numbers of studies (k) has been shown to be inappropriate and can be misleading; for k = 2, H is uninformative since H = Q ÷ (k −1) = Q. (7) After all, heterogeneity in KSJ appears to be much smaller than suggested by KHJ, with H estimates ranging from 1.0 to 3.2 when including inattentive subjects and between 1.0 and 1.6 when excluding inattentive subjects. How heterogeneous are the populations in KSJ? All online samples in Study 1 were drawn from " anglophone participants in three highly developed countries " (1) and turned out to be relatively similar in terms of demographic characteristics. When the crowdsourced marketplace is held constant across three samples (Prolific, Prolific US, and Prolific UK), effect size estimates are remarkably similar. This suggests that data collection via online platforms may exacerbate het-erogeneity due to, for instance, different procedures for screening and compensating participants. This, in turn, may introduce variability in the number of bots, attrition rates, attention, experience, and comprehension—moderating factors for which KSJ provides supporting evidence. Con sequently, the generalizability of empirical claims across and beyond varying online marketplaces may be lower than for studies based on laboratory or observational data. The extent of population heterogeneity is ultimately an empirical question that requires further investigation and evidence.