NEUROFIT Real-World Outcomes
A Longitudinal Observational Analysis of Self-Reported Stress (NF8) Across 16,487 App Users
Scope of this document
This whitepaper reports real-world observational outcomes from consumer use of the NEUROFIT app. It is not a randomized controlled trial and does not establish that NEUROFIT causes the changes reported below. The analysis describes how self-reported stress changed over time among people who chose to use the app, and presents the steps we took to test the most common non-causal explanations for that change. Where we compare our figures to published therapy benchmarks, those comparisons are illustrative context, not equivalence claims — the populations, instruments, constructs, and timeframes differ.
NEUROFIT is a low-risk wellness product. Nothing here is a medical diagnosis or a treatment claim. A full limitations section appears near the end of this document, and we encourage readers to weigh the findings against it.
Summary of findings
Across 16,487 users in 100+ countries over 32 months, self-reported stress on NEUROFIT's 8-domain somatic scale (the NF8) declined rapidly and the decline was sustained:
- Day 7: 38.7% reduction in mean NF8 score (Cohen's d = 0.80; 95% CI [0.67, 0.93]).
- Day 35: 52.1% reduction (d = 0.86; 95% CI [0.62, 1.10]).
- Among higher-baseline-stress users (NF8 ≥ 5/16), effects were larger (Day 7 d = 0.96).
- A prespecified responder threshold (≥ 30% NF8 reduction) was met by roughly 60% of checking-in users by Day 6–14.
We then tested whether these results could be artifacts of (a) regression to the mean, (b) response-driven dropout, or (c) selection of paying users. Each test is reported below. The findings reduce — but, in the absence of a randomized control arm, do not eliminate — non-intervention explanations.
Dataset and methods
Population. All users with a baseline NF8 score ≥ 1/16 (i.e. any reported stress) at their first in-app check-in. Primary population N = 16,487; a higher-stress subgroup (baseline ≥ 5/16) N = 11,766.
Outcome. The NF8, a real-time self-report of current nervous-system state across eight life domains, each scored 0–2 (total range 0–16). Construction and validation are detailed below. Baseline is the user's first check-in; users complete at most one check-in per day.
Analysis. Mixed-Model Repeated Measures (MMRM) with AR(1) and compound-symmetry covariance structures, under an intent-to-treat framework: every user meeting the baseline criterion with at least one post-baseline check-in is included, regardless of subsequent engagement. MMRM uses all available observations and yields valid estimates under a Missing-At-Random (MAR) assumption; the evidence bearing on that assumption is reported in Robustness, below.
Analyst. Independent analysis by Lisa Kang, MSPH, MS-CSO Harvard Medical School — a biostatistician with 15+ years of pharmaceutical-industry experience (Pfizer, GSK, Genzyme), engaged as an independent statistical advisor for this analysis and subsequently retained as a clinical advisor. Reproduction by a separate, unaffiliated party is a planned next step (see Limitations). (Her peer-reviewed clinical-trials work is published under the name Lih Lisa Kang / Kang LL.)
A note on engagement: because NEUROFIT is designed as a short-window, rapid-relief intervention, declining check-in frequency over time is an expected feature of the usage model. Retention figures are reported transparently alongside each timepoint so readers can see the N underlying every estimate.
Primary outcomes
Primary population — baseline NF8 ≥ 1/16 (N = 16,487)
| Timepoint | N | NF8 reduction | Cohen's d | 95% CI | Retention |
|---|---|---|---|---|---|
| Day 2 | 10,202 | 23.5% | 0.71 | [0.61, 0.80] | 61.9% |
| Day 6 | 4,602 | 37.0% | 0.81 | [0.68, 0.94] | 27.9% |
| Day 7 | 4,158 | 38.7% | 0.80 | [0.67, 0.93] | 25.2% |
| Day 14 | 2,481 | 44.2% | 0.85 | [0.69, 1.00] | 15.0% |
| Day 35 | 969 | 52.1% | 0.86 | [0.62, 1.10] | 5.9% |
All reductions p < 0.0001. A full-sample sensitivity analysis (N = 17,080, including baseline-zero users) yielded near-identical reductions (23% / 38% / 52%) with d = 0.72 at Day 7, consistent with expected dilution from floor effects.
Higher-stress subgroup — baseline NF8 ≥ 5/16 (N = 11,766)
| Timepoint | N | NF8 reduction | Cohen's d | 95% CI | Retention |
|---|---|---|---|---|---|
| Day 2 | 5,188 | 27.1% | 0.88 | [0.77, 0.99] | 44.1% |
| Day 7 | 2,078 | 41.9% | 0.96 | [0.81, 1.11] | 17.7% |
| Day 14 | 1,200 | 46.6% | 1.03 | [0.85, 1.22] | 10.2% |
| Day 35 | 457 | 52.4% | 1.10 | [0.82, 1.38] | 3.9% |
This subgroup is more strongly selected on baseline severity, which makes it the appropriate place to test whether the effect survives a meaningful regression-to-the-mean correction (the primary population's selection is too slight to provide that test). That analysis is reported in Robustness, below; the short version is that after correcting for RTM, a large residual effect remains.
Responder analysis (≥ 30% NF8 reduction, prespecified threshold)
| Population | Day 6 | Day 7 | Day 14 |
|---|---|---|---|
| All check-in users | 60% | 59% | 60% |
| Baseline ≥ 1/16 | 62% | 61% | 62% |
| Baseline ≥ 5/16 | 65% | 63% | 65% |
Robustness: testing non-causal explanations
These analyses are the core of the document. None substitutes for a randomized control group, but together they constrain the leading alternative explanations.
Regression to the mean (RTM). RTM is expected in any baseline-selected longitudinal design, and its magnitude scales with how strongly the population is selected. Using the Barnett/Linden estimation framework (Barnett et al., 2005; Linden, 2013):
- Primary population (NF8 ≥ 1/16). With observed test-retest reliability (ICC = 0.40) and an inclusion threshold that excludes only 3.5% of check-in users, the estimated RTM contribution is d ≈ 0.05 — negligible against the observed d = 0.80. (Note that this near-absence of RTM is a direct consequence of the threshold being so permissive; it shows the primary effect is not an RTM artifact, but it cannot demonstrate that the effect survives a substantial correction, because there is almost nothing to correct.)
- Higher-stress subgroup (NF8 ≥ 5/16). Here the threshold excludes 31% of users, so RTM is materially larger: estimated d ≈ 0.31. Subtracting this from the observed Day-7 effect leaves a residual d ≈ 0.65 — still a large effect after a meaningful correction. This is reported not because the corrected number is bigger, but because it is the stronger test: the effect persists once a non-trivial RTM component is removed.
We report the residual explicitly rather than presenting the uncorrected subgroup effect as a headline.
Response-driven dropout (the MAR question). If users who weren't improving were the ones who stopped checking in, effect sizes among completers would be inflated. Three sensitivity analyses are inconsistent with that pattern: (1) baseline severity does not meaningfully predict dropout (d = 0.05–0.08, trivial); (2) early treatment response (Days 0–6) does not predict subsequent dropout (d = 0.04, p = 0.21); (3) early response does not predict paid conversion (d = 0.04, p = 0.37; responders and non-responders convert at statistically identical rates, 86.1% vs 86.5%, p = 0.78). These results are consistent with the MAR assumption and indicate dropout is engagement-driven rather than response-driven. (MAR cannot be proven directly, because it concerns unobserved values; see Limitations.)
Selection of paying users (a natural experiment). Because the primary population is self-paying, one could argue the effect reflects motivated, self-selected buyers. A cohort of N = 284 users who received gifted free access from clinicians — removing the cost barrier and changing the selection pathway — provides a check. These users retained at 50.4% at Day 7 (vs 15.7% for self-pay, a 3.2× difference), yet their effect size was identical to self-pay users (gifted d = 0.71 vs self-pay d = 0.71 at Day 7; Δd = +0.006). This indicates that paid-user selection does not manufacture the observed effect, and that reach is constrained by cost rather than by efficacy. It does not introduce a no-intervention comparison group — all users in both cohorts used the app — and so does not by itself establish causation.
The NF8 instrument
What it measures. The NF8 captures a user's current self-reported nervous-system state, in real time, across eight life domains (current state, relationships, finances, family/friends, career, health, purpose, environment). Each domain is scored 0–2 by mapping the user's selected state to a dysregulation weight (regulated states = 0; fight/flight or shutdown = 1; combined overwhelm = 2), summed to a 0–16 total.
Why a new instrument. Established scales such as the PHQ-9 and GAD-7 use 2–4 week recall windows and therefore cannot resolve change within a 2–7 day window. The NF8's real-time design is what enables detection of short-horizon change; this is also why its 24-hour test-retest reliability is moderate by design (ICC(2,1) = 0.52–0.57 during active intervention) — high test-retest stability during a period of genuine change would indicate insensitivity to that change, not measurement quality.
Psychometric properties.
| Comparison | N | Pearson r | 95% CI |
|---|---|---|---|
| DASS-21 (Stress subscale) | 120 | 0.52 | [0.37, 0.64] |
| GAD-7 | 159 | 0.57 | [0.45, 0.67] |
| PHQ-9 | 128 | 0.63 | [0.51, 0.73] |
| PHQ-9 (dorsal-emphasized scoring) | 128 | 0.72 | [0.62, 0.79] |
All p < 0.0001. Internal consistency: Cronbach's α = 0.80 (95% CI 0.77–0.83, n = 407). Convergent-validity samples are modest in size and drawn from the target market rather than clinical settings (see Limitations). Biomarker convergence is reported separately below.
Theoretical framing. The NF8's state structure is consistent with polyvagal theory (Porges, 1995; 2011) but does not depend on it. Polyvagal theory's neurophysiological premises are the subject of active scientific debate; the NF8's claim to validity rests on its empirical psychometric performance (convergent correlations, internal consistency, biomarker alignment), not on the correctness of any single theoretical account. We describe the states descriptively for that reason.
PTSD relevance (planned, not yet validated). Direct validation of the NF8 against the PCL-5 (PTSD Checklist for DSM-5) has not been conducted; a general-wellness population offers too little PCL-5 variance for a meaningful estimate. A direct NF8 ↔ PCL-5 validation is planned as a primary outcome of a future clinical/military pilot. We make no PTSD-specific efficacy claim in this document.
Biomarker convergence
We separate this from the questionnaire-based validation above because it speaks to a different and deeper question: whether the NF8 tracks physiology at all, or is purely mood-congruent self-report. The honest headline is that the raw cross-sectional correlations are small, and the more informative finding is in their temporal structure.
Cross-sectional correlations are small — by expectation. Across 100,345 phone-camera PPG observations from 12,845 users, NF8 total score correlated with all three autonomic markers in the literature-consistent direction, but weakly:
| Biomarker | N | r | 95% CI | Direction |
|---|---|---|---|---|
| HRV (RMSSD) | 100,345 | −0.029 | [−0.035, −0.023] | Higher stress → lower HRV |
| Resting heart rate | 100,345 | +0.091 | [+0.085, +0.097] | Higher stress → higher HR |
| Breathing rate | 100,345 | +0.032 | [+0.025, +0.038] | Higher stress → faster breathing |
All p < 0.001 (significance here is driven largely by the very large N; the effect sizes are what matter, and they are small). These should be read as lower-bound estimates, attenuated by three factors: substantial measurement noise in phone-camera PPG relative to clinical-grade equipment; entirely uncontrolled real-world measurement conditions; and temporal decoupling between the PPG reading and the NF8 check-in. Clinical-grade ECG validation in controlled settings is planned to reduce this noise.
The temporal gradient is the real finding. PPG–NF8 coupling strengthens systematically as the reading moves closer to the check-in — and, critically, is strongest for readings taken before the user has entered any self-report:
| Timing window | N | HR r | HRV r | RR r |
|---|---|---|---|---|
| 2–20 min before check-in | 4,361 | +0.148 | −0.064 | +0.048 |
| 0–2 min before check-in | 5,807 | +0.193 | −0.054 | +0.089 |
| All data (mixed timing) | 100,345 | +0.091 | −0.029 | +0.032 |
| 0–2 min after check-in | 71,714 | +0.079 | −0.029 | +0.027 |
| 2–20 min after check-in | 19,274 | +0.086 | −0.036 | +0.028 |
All p < 0.001. Pre-assessment correlations run roughly 2–2.4× stronger than post-assessment across all three markers, a pattern present in every one of the 11 calendar quarters analyzed (HR gradient ratio range 1.27–3.81×). This tightening as decoupling decreases is itself consistent with the attenuation argument above: when the reading is closer in time to the report, more of the true signal survives.
What this does and does not establish. Because the pre-assessment reading is taken before the user engages the self-report interface, the user's autonomic state in those minutes predicts their about-to-be-reported NF8. This is temporal precedence evidence that endogenous physiological state precedes and predicts the self-report, rather than the self-report being a free-floating expression of expectation — the central concern with any proprietary self-report measure. A within-person analysis points the same way: day-to-day NF8 changes tracked autonomic shifts (heart rate r = +0.150, p < 0.001, N = 673). One alternative explanation — that users were anchoring to biomarker values displayed after each reading — was tested directly: removing the strongest available anchor (the interpreted composite BALANCE score) produced no within-person change in PPG–NF8 coupling, which argues against perceptual anchoring as the driver.
We are deliberately careful about the ceiling of this claim. Temporal precedence is not proof that the NF8 measures the physiology; both could be downstream of a shared underlying state, and a small correlation — even one with informative temporal structure — remains small. What the gradient does support, and support fairly robustly, is the narrower and still-important point that NF8 scores are not merely motivated self-report: a physiological signal measured before the user reports anything systematically predicts what they then report.
Contextual benchmark (rough, pre-RCT)
We include a benchmark to help readers situate an effect size, with an explicit caveat about what kind of number ours is. NEUROFIT's d ≈ 0.80 is a within-person pre-post change — a user's baseline compared to their later score. The methodologically matched comparison is therefore to other within-condition pre-post changes, not to controlled between-group effects.
On that matched basis: a meta-analysis of 333 depression RCTs found within-arm pre-post improvement of g = 0.37 in waitlist conditions and g = 0.64 in care-as-usual conditions (Cuijpers et al., 2024). NEUROFIT's pre-post change is comparable to, or modestly above, the care-as-usual figure — in a substantially shorter window.
For broader context only, and not a like-for-like comparison: the controlled between-group effect of psychotherapy for depression is g = 0.63 unadjusted, falling to g = 0.31 after correction for risk of bias and publication bias (Cuijpers et al., 2019). These are between-group effects, meaning the counterfactual — natural recovery, expectancy, regression to the mean — has already been subtracted out. A within-person pre-post number like ours still contains those components, so it cannot be set directly against a between-group figure, and we make no claim that NEUROFIT "exceeds therapy."
The honest reading: this is a rough placeholder until a randomized trial produces a between-group NEUROFIT effect, and that controlled number will very likely be lower than the uncontrolled 0.80, because a control arm removes precisely the components a pre-post change retains. The constructs (somatic stress vs depression), instruments, populations, and timeframes all differ. We present these figures to orient, not to assert equivalence, and we explicitly do not claim that NEUROFIT is a substitute for psychotherapy.
Limitations
We consider an honest limitations section part of the evidence, not a disclaimer appended to it.
- No randomized control group. This is a single-arm observational analysis. It cannot, on its own, separate the intervention's effect from natural recovery, expectancy, or other concurrent factors. Causal language is therefore avoided, and benchmark comparisons are illustrative. A waitlist-controlled randomized trial is the appropriate next step and is planned.
- Self-reported outcome on a proprietary instrument. The NF8 is a self-report measure developed in-house. It is not a clinician-administered or independently established diagnostic instrument.
- Unmeasured pre-engagement selection. The path from app download to analyzed user is not fully observed; generalizability to less-motivated populations is unknown. (The clinician-gifted cohort addresses the payment selection axis specifically, not all selection.)
- MAR is supported but not proven. The sensitivity analyses are consistent with Missing-At-Random, but MNAR concerns dependence on unobserved values and cannot be definitively excluded. Formal MNAR sensitivity bounds (e.g. delta-adjusted / tipping-point analyses) are a reasonable addition for future versions.
- RTM is quantified for both populations. The primary population's RTM is negligible (d ≈ 0.05); the more-selected ≥ 5/16 subgroup's is larger (d ≈ 0.31), leaving a large residual effect (d ≈ 0.65). Both estimates depend on the test-retest reliability input (ICC); readers recomputing them should use the cohort-appropriate reliability. Formal MNAR sensitivity bounds (delta-adjusted / tipping-point) remain a reasonable addition for future versions.
- Modest validation samples. Questionnaire convergent-validity Ns (120–407) are small and drawn from the target market, not clinical settings. Biomarker correlations, while based on a very large number of observations, are small in magnitude (see Biomarker convergence) and derived from uncontrolled phone-camera readings; clinical-grade replication is planned.
- Cross-scale comparisons are approximate. NF8-to-clinical-scale correlations (r = 0.52–0.63) represent overlapping but non-identical constructs (27–40% shared variance); the NF8 and these instruments are not interchangeable.
- Independent at analysis, not yet reproduced. The analysis was performed by a statistical advisor who was independent of the company at the time the work was conducted. Reproduction by a separate party with no relationship to the company would further strengthen these findings and is planned.
Regulatory and claims statement
This document presents observational analyses of real-world app usage and reflects self-reported stress among engaged users. It is intended for general wellness information, consistent with FTC Health Products Compliance Guidance and FDA policy for low-risk mobile wellness applications. No disease-specific, diagnostic, or treatment representations are made. Benchmark comparisons are for illustrative purposes only. Figures are subject to ongoing clinical, statistical, and regulatory review.
References
- Barnett, A. G., van der Pols, J. C., & Dobson, A. J. (2005). Regression to the mean: what it is and how to deal with it. International Journal of Epidemiology, 34(1), 215–220.
- Bovin, M. J., Marx, B. P., Weathers, F. W., et al. (2016). Psychometric properties of the PTSD Checklist for DSM-5 (PCL-5) in veterans. Psychological Assessment, 28(11), 1379–1391.
- Critchley, H. D., & Garfinkel, S. N. (2017). Interoception and emotion. Current Opinion in Psychology, 17, 7–14.
- Cuijpers, P., Karyotaki, E., Reijnders, M., & Ebert, D. D. (2019). Was Eysenck right after all? A reassessment of the effects of psychotherapy for adult depression. Epidemiology and Psychiatric Sciences, 28(1), 21–30.
- Cuijpers, P., Miguel, C., Harrer, M., Ciharova, M., & Karyotaki, E. (2024). The overestimation of the effect sizes of psychotherapies for depression in waitlist controlled trials. Epidemiology and Psychiatric Sciences, 33, e56.
- Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613.
- Linden, A. (2013). Assessing regression to the mean effects in health care initiatives. BMC Medical Research Methodology, 13, 119.
- Lovibond, S. H., & Lovibond, P. F. (1995). Manual for the Depression Anxiety Stress Scales (2nd ed.). Psychology Foundation of Australia.
- Porges, S. W. (1995). Orienting in a defensive world: mammalian modifications of our evolutionary heritage. A polyvagal theory. Psychophysiology, 32(4), 301–318.
- Porges, S. W. (2011). The Polyvagal Theory. W. W. Norton & Company.
- Spitzer, R. L., Kroenke, K., Williams, J. B., & Löwe, B. (2006). A brief measure for assessing generalized anxiety disorder: the GAD-7. Archives of Internal Medicine, 166(10), 1092–1097.
Frequently asked questions
Does NEUROFIT reduce stress?
In a real-world observational analysis of 16,487 app users, self-reported stress on the NF8 scale fell 38.7% by Day 7 and 52.1% by Day 35 (Cohen's d = 0.80 and 0.86). However, this is a single-arm observational analysis, not a randomized controlled trial, so it does not establish that NEUROFIT causes these changes; without a control group, natural recovery, expectancy, and other concurrent factors cannot be ruled out.
What's the evidence behind NEUROFIT?
The evidence is a 32-month observational analysis of 16,487 users across 100+ countries, in which self-reported NF8 stress scores declined rapidly and stayed lower, with roughly 60% of checking-in users meeting a prespecified threshold of at least a 30% reduction by Day 6–14. The analysis also tested the leading non-causal explanations — regression to the mean, response-driven dropout, and selection of paying users — and found a large residual effect remained, though none of these tests substitutes for a randomized control group.
Is the NF8 validated?
The NF8 shows good internal consistency (Cronbach's α = 0.80) and correlates with established scales (DASS-21 Stress r = 0.52, GAD-7 r = 0.57, PHQ-9 r = 0.63), and physiological readings taken before a check-in predict the about-to-be-reported score. These convergent-validity samples are modest in size and drawn from the target market rather than clinical settings, and the NF8 is a proprietary self-report measure, not a clinician-administered or independently established diagnostic instrument.
What are the limitations of the NEUROFIT data?
The main limitation is the absence of a randomized control group, so the analysis cannot separate the intervention's effect from natural recovery or expectancy, and causal language is avoided. Other limitations include a self-reported proprietary outcome measure, unmeasured pre-engagement selection, modest validation samples, biomarker correlations that are small in magnitude and derived from uncontrolled phone-camera readings, and the fact that the analysis — though independent at the time it was conducted — has not yet been reproduced by a separate party.
© NEUROFIT (Xama Technologies, Inc.). This whitepaper describes real-world observational data and is intended for public distribution.