Study design: Within-subjects, 18 participants × 3 topics (Mendel, DNA-Replikation, Ökologie) × 3 tutoring mediums (Chat, Video, VR), counterbalanced Latin-square. Each topic tested at 4 timepoints: Pre-Reading → Post-Reading → Pre-Tutoring → Post-Tutoring. Tests: 15 multiple-choice questions per test, with confidence ratings (1–7 scale) per question.

Key Numbers

Metric	Value
Participants	18
Total test entries	216 (18 × 3 topics × 4 timepoints)
Overall start-to-finish gain (Pre-Reading → Post-Tutoring)	+27.9 pp (SD=19.9, t=10.32)
Overall tutoring gain (Pre-Tutoring → Post-Tutoring)	+10.6 pp
Highest tutoring gain by medium	VR: +13.7 pp (d=0.62)
Highest tutoring gain by topic	DNA-Replikation: +16.7 pp

Medium Preference Rankings

Within-subject preference rankings (N=18): each participant ranked all three mediums from 1 (most preferred) to 3 (least preferred)

Medium	Mean Rank	Median	SD	Ranked 1st	Ranked 2nd	Ranked 3rd
Chat	1.61	1.0	0.85	50% (9/18)	33% (6/18)	17% (3/18)
VR	2.06	2.0	0.87	28% (5/18)	39% (7/18)	33% (6/18)
Video	2.33	2.0	0.59	6% (1/18)	61% (11/18)	33% (6/18)

Individual Participant Rankings

Participant	1st Choice	2nd Choice	3rd Choice
P1	VR	Chat	Video
P2	Chat	Video	VR
P3	VR	Chat	Video
P4	VR	Video	Chat
P5	VR	Chat	Video
P6	Chat	Video	VR
P7	Chat	Video	VR
P8	Chat	Video	VR
P9	Chat	VR	Video
P10	Chat	VR	Video
P11	Chat	VR	Video
P12	Chat	VR	Video
P13	Video	VR	Chat
P14	Chat	Video	VR
P15	VR	Video	Chat
P16	Chat	Video	VR
P17	VR	Video	Chat
P18	Chat	Video	VR

Participant Scores Overview

Scores by Participant, Topic & Medium

Participant	Topic	Medium	Pre-Reading	Post-Reading	Pre-Tutoring	Post-Tutoring	Tutoring Gain
P1	DNA-Replikation	Video	40.0	80.0	53.3	60.0	+6.7
P1	Mendel	Chat	60.0	73.3	80.0	93.3	+13.3
P1	Ökologie	VR	60.0	100.0	100.0	100.0	0.0
P2	DNA-Replikation	Video	40.0	60.0	40.0	33.3	−6.7
P2	Mendel	Chat	53.3	80.0	60.0	66.7	+6.7
P2	Ökologie	VR	93.3	66.7	86.7	80.0	−6.7
P3	DNA-Replikation	Video	13.3	66.7	53.3	86.7	+33.4
P3	Mendel	Chat	60.0	60.0	53.3	86.7	+33.4
P3	Ökologie	VR	86.7	93.3	13.3	100.0	+86.7
P4	DNA-Replikation	VR	33.3	93.3	93.3	100.0	+6.7
P4	Mendel	Video	60.0	80.0	86.7	93.3	+6.6
P4	Ökologie	Chat	60.0	93.3	93.3	93.3	0.0
P5	DNA-Replikation	VR	53.3	66.7	46.7	53.3	+6.6
P5	Mendel	Video	33.3	53.3	66.7	73.3	+6.6
P5	Ökologie	Chat	60.0	80.0	86.7	80.0	−6.7
P6	DNA-Replikation	VR	53.3	66.7	80.0	100.0	+20.0
P6	Mendel	Video	66.7	100.0	93.3	93.3	0.0
P6	Ökologie	Chat	86.7	100.0	100.0	100.0	0.0
P7	DNA-Replikation	Chat	60.0	33.3	60.0	60.0	0.0
P7	Mendel	VR	60.0	86.7	80.0	80.0	0.0
P7	Ökologie	Video	66.7	73.3	80.0	73.3	−6.7
P8	DNA-Replikation	Chat	40.0	20.0	26.7	86.7	+60.0
P8	Mendel	VR	53.3	86.7	60.0	80.0	+20.0
P8	Ökologie	Video	40.0	86.7	80.0	86.7	+6.7
P9	DNA-Replikation	Chat	20.0	60.0	66.7	86.7	+20.0
P9	Mendel	VR	0.0	53.3	33.3	66.7	+33.4
P9	Ökologie	Video	40.0	60.0	66.7	60.0	−6.7
P10	DNA-Replikation	Video	26.7	46.7	13.3	86.7	+73.4
P10	Mendel	Chat	66.7	80.0	73.3	86.7	+13.4
P10	Ökologie	VR	73.3	93.3	93.3	93.3	0.0
P11	DNA-Replikation	Video	53.3	80.0	86.7	93.3	+6.6
P11	Mendel	Chat	46.7	80.0	80.0	80.0	0.0
P11	Ökologie	VR	73.3	100.0	80.0	100.0	+20.0
P12	DNA-Replikation	Video	20.0	53.3	46.7	33.3	−13.4
P12	Mendel	Chat	53.3	86.7	73.3	93.3	+20.0
P12	Ökologie	VR	60.0	66.7	86.7	80.0	−6.7
P13	DNA-Replikation	VR	66.7	66.7	100.0	100.0	0.0
P13	Mendel	Video	66.7	73.3	86.7	93.3	+6.6
P13	Ökologie	Chat	100.0	86.7	100.0	93.3	−6.7
P14	DNA-Replikation	VR	80.0	93.3	93.3	93.3	0.0
P14	Mendel	Video	66.7	93.3	100.0	100.0	0.0
P14	Ökologie	Chat	66.7	86.7	86.7	80.0	−6.7
P15	DNA-Replikation	VR	33.3	53.3	33.3	66.7	+33.4
P15	Mendel	Video	46.7	60.0	73.3	80.0	+6.7
P15	Ökologie	Chat	60.0	80.0	80.0	80.0	0.0
P16	DNA-Replikation	Chat	46.7	40.0	53.3	80.0	+26.7
P16	Mendel	VR	46.7	66.7	66.7	80.0	+13.3
P16	Ökologie	Video	86.7	80.0	86.7	100.0	+13.3
P17	DNA-Replikation	Chat	40.0	53.3	46.7	66.7	+20.0
P17	Mendel	VR	53.3	80.0	60.0	80.0	+20.0
P17	Ökologie	Video	80.0	66.7	80.0	73.3	−6.7
P18	DNA-Replikation	Chat	26.7	86.7	86.7	93.3	+6.6
P18	Mendel	VR	46.7	93.3	93.3	93.3	0.0
P18	Ökologie	Video	80.0	93.3	93.3	93.3	0.0

Participant Summary

Participant	N Tests	Avg Score %	Avg Confidence	Pre-Read	Post-Read	Pre-Tutor	Post-Tutor	Reading Gain	Tutoring Gain
P1	12	75.0	4.38	53.3	84.4	77.8	84.4	+31.1	+6.7
P2	12	63.3	3.22	62.2	68.9	62.2	60.0	+6.7	−2.2
P3	12	64.4	3.83	53.3	73.3	40.0	91.1	+20.0	+51.2
P4	12	81.6	5.49	51.1	88.9	91.1	95.5	+37.8	+4.4
P5	12	62.8	3.24	48.9	66.7	66.7	68.9	+17.8	+2.2
P6	12	86.7	5.35	68.9	88.9	91.1	97.8	+20.0	+6.7
P7	12	67.8	3.03	62.2	64.4	73.3	71.1	+2.2	−2.2
P8	12	62.2	4.48	44.4	64.5	55.6	84.5	+20.0	+28.9
P9	12	51.1	1.50	20.0	57.8	55.6	71.1	+37.8	+15.6
P10	12	69.4	4.08	55.6	73.3	60.0	88.9	+17.8	+28.9
P11	12	79.4	3.98	57.8	86.7	82.2	91.1	+28.9	+8.9
P12	12	62.8	4.69	44.4	68.9	68.9	68.9	+24.5	0.0
P13	12	86.1	5.47	77.8	75.6	95.6	95.5	−2.2	0.0
P14	12	86.7	3.92	71.1	91.1	93.3	91.1	+20.0	−2.2
P15	12	62.2	4.72	46.7	64.4	62.2	75.6	+17.8	+13.4
P16	12	69.5	4.07	60.0	62.2	68.9	86.7	+2.2	+17.8
P17	12	65.0	3.57	57.8	66.7	62.2	73.3	+8.9	+11.1
P18	12	81.7	3.86	51.1	91.1	91.1	93.3	+40.0	+2.2

Scores by Medium (aggregate)

Medium	Avg Score %	Avg Confidence	Tutoring Gain	Cohen's d
Chat	70.8 (SD=20.5)	4.09 (SD=1.78)	+11.1 pp	d=0.65
Video	68.5 (SD=22.2)	3.99 (SD=1.80)	+7.0 pp	d=0.36
VR	73.6 (SD=22.3)	4.06 (SD=1.64)	+13.7 pp	d=0.62

Scores by Topic (aggregate)

Topic	Avg Score %	Avg Confidence	Tutoring Gain
Mendel	71.8 (SD=18.4)	4.15 (SD=1.75)	+11.1 pp (SD=10.7)
DNA-Replikation	60.1 (SD=24.3)	3.40 (SD=1.75)	+16.7 pp (SD=22.5)
Ökologie	81.1 (SD=16.4)	4.60 (SD=1.49)	+4.1 pp (SD=22.0)

Confidence by Medium (Tutoring Phase)

Medium	Pre-Tutoring Conf	Post-Tutoring Conf	ΔConfidence
Chat	4.09	5.49	+1.40 (d=1.20)
Video	4.11	5.23	+1.12 (d=1.19)
VR	4.23	5.30	+1.07 (d=1.09)

A. Overall Learning Trajectory

A1 – Overall Trajectory (Score + Confidence)

Scores rise from 54.8% (Pre-Reading) to 74.3% (Post-Reading), dip slightly to 72.1% (Pre-Tutoring due to the time gap between sessions), then climb to 82.7% (Post-Tutoring). Confidence tracks this pattern closely, increasing from 2.40 → 4.31 → 4.14 → 5.34 on the 1–7 scale. The reading phase accounts for the largest single jump (+19.5 pp), while tutoring adds another +10.6 pp.

A2 – Trajectory by Medium

All three mediums show the same general upward trajectory. VR reaches the highest Post-Tutoring score (85.9%), followed by Chat (83.7%) and Video (78.5%). The Pre-Tutoring baselines are comparable (~72% for all three), so the medium differences emerge specifically during the tutoring phase.

A3 – Trajectory by Topic

Ökologie starts highest (Pre-Reading 69.3%) and stays highest throughout, suggesting greater prior knowledge. DNA-Replikation starts lowest (42.6%) and shows the steepest climb, gaining +16.7 pp from tutoring alone. Mendel is intermediate. The topic-level differences highlight that DNA-Replikation has the most room for improvement, while Ökologie may suffer from ceiling effects.

A4 – Participant-Level Heatmaps

Heatmaps of individual scores and confidence across all 4 timepoints. Notable patterns: P3 has a dramatic 13.3% → 100.0% swing for Ökologie during tutoring; P8 shows a large Chat tutoring gain (26.7% → 86.7% in DNA-Replikation); P9 consistently low confidence (avg 1.50) despite moderate scores.

B. Tutoring Phase Deep-Dive

B1 – Paired Slopes by Medium (with Statistics)

Individual Pre → Post-Tutoring score changes per participant, grouped by medium. Each line is one participant-topic pair. Chat and VR show more upward slopes; Video has the most mixed pattern. Paired t-test results and Cohen's d annotated on each panel.

B2 – Paired Slopes by Topic

Same paired-slope view, now grouped by topic. DNA-Replikation shows the most dramatic improvements (many steep upward lines from low baselines), while Ökologie has flatter slopes due to already-high Pre-Tutoring scores.

B3 – Tutoring Gains by Medium (Effect Sizes)

Bar charts with individual data points showing score gains (left) and confidence gains (right) by medium. VR leads with +13.7 pp (d=0.62), Chat follows at +11.1 pp (d=0.65), and Video lags at +7.0 pp (d=0.36). Confidence gains are large across all mediums (d > 1.0), with Chat showing the highest confidence boost (+1.40, d=1.20).

B4 – Medium × Topic Interaction

Tutoring gains broken down by both medium and topic. The interaction reveals that gains vary considerably across topic–medium combinations. DNA-Replikation benefits most from tutoring regardless of medium, while Ökologie gains are smallest (ceiling effect).

B5 – Tutoring Effectiveness Dashboard

Six-panel dashboard combining: Pre vs Post scores, gain distributions, medium comparison, score vs gain relationship, confidence change, and individual trajectories. Provides a comprehensive at-a-glance view of tutoring effectiveness.

C. Start-to-Finish Gains

C1 – Pre-Reading to Post-Tutoring (Paired)

Each line connects a participant's Pre-Reading score to their Post-Tutoring score for each topic. The overall gain of +27.9 pp (t=10.32, p<.001) represents the full learning effect of reading + tutoring combined. Nearly all lines slope upward, demonstrating consistent learning across participants.

C2 – Learning Gains Overview

Side-by-side comparison of reading gains vs tutoring gains across mediums and topics. The reading phase contributes more absolute score improvement on average (+19.5 pp) than the tutoring phase (+10.6 pp), but tutoring builds on already-higher baselines and adds further consolidation.

D. Confidence Analysis

D1 – Confidence vs Test Score (Scatter)

Strong positive correlation between test scores and average confidence ratings. Participants who score higher also report higher confidence. This holds across all timepoints, though the relationship is tightest at Post-Tutoring when both scores and confidence are highest.

D2 – Change in Confidence vs Change in Score

During tutoring, score gains and confidence gains are positively correlated — participants who improved their scores also became more confident. However, some participants show large confidence increases even with modest score gains, suggesting tutoring boosts metacognitive awareness beyond pure knowledge gains.

D3 – Confidence Calibration

Calibration analysis comparing actual performance to self-reported confidence. Participants tend to be slightly under-confident at Pre-Reading (low confidence, moderate scores) and approach better calibration by Post-Tutoring. This indicates that the full learning journey improves not just knowledge but also self-assessment accuracy.

E. Personality Correlations

E1 – Big Five vs Tutoring Outcomes (Heatmap)

Pearson correlation heatmap between Big Five personality traits and tutoring outcomes (score gain, confidence gain, post-tutoring score, post-tutoring confidence). Notable finding: Agreeableness shows a significant positive correlation with confidence gain (r≈0.60, p<.05), suggesting that more agreeable participants showed larger confidence boosts from tutoring.

E2 – Trait vs Tutoring Score Gain

Scatter plots of each Big Five trait against tutoring score gain, with regression lines and correlation coefficients. Most personality traits show weak relationships with score gains, confirming that tutoring effectiveness is relatively independent of personality in this sample. The strongest trend is Agreeableness → confidence gain rather than score gain.

G. Effect Analysis

Generated by generate_plots_effects.py → Data/plots_effects/ Statistical exports → Data/stats/effects_*.csv, Data/stats/outlier_influence.csv

G-F. Effect Without Ökologie (vs. With)

Ökologie has markedly higher pre-tutoring baselines (ceiling effects), which compresses gains for that topic. This section quantifies how much those ceiling effects suppress the observed effect sizes, and presents a full side-by-side comparison of all mediums with and without Ökologie included.

GF1 – Cohen's d Comparison by Medium

Grouped bar chart of Cohen's d per medium under two conditions: All Topics and Excl. Ökologie. Each bar is annotated with the raw mean gain and significance stars. Reference lines mark the conventional small (0.2), medium (0.5), and large (0.8) effect size thresholds. Removing Ökologie consistently raises effect sizes for all mediums, with VR showing the largest absolute shift.

GF2 – Mean Score Gain Comparison

95% CI bar chart of the raw mean tutoring score gain per medium, both conditions overlaid. Shows the absolute gain shift when Ökologie is excluded. VR and Chat benefit most from exclusion; Video's gain changes least, indicating Video's Ökologie sessions were not as strongly ceiling-affected.

GF3 – Paired Slopes: All Topics vs. Excl. Ökologie

A 2×3 grid (rows: All Topics / Excl. Ökologie; columns: Chat / Video / VR). Each panel shows individual Pre→Post-Tutoring lines colored by topic, the medium mean trajectory (thick diamond marker), and annotated t-test / Cohen's d / p-value. The bottom row directly reveals the cleaner separation in trajectories once the near-zero Ökologie gains are removed.

GF4 – Gain Distribution Comparison

Side-by-side violin + box plots per medium, two per medium (All Topics / Excl. Ökologie). Shows the shift in median, spread, and the location of extreme values. For VR in particular, removing Ökologie tightens the distribution and raises the median, confirming Ökologie's pull toward zero.

GF5 – Descriptive Statistics Table

Rendered table summarizing N, mean gain, SD, Cohen's d, t-statistic, and p-value for all 6 conditions (3 mediums × 2 topic sets) in one view.

G-G. Effect Per Topic

Full effect-size breakdown for each of the three topics independently, across all mediums combined.

GG1 – Effect Per Topic (Gain + Cohen's d)

Left panel: mean tutoring score gain with 95% CI error bars per topic, annotated with N and significance. Right panel: Cohen's d per topic with threshold reference lines. DNA-Replikation yields the largest effect (high starting deficit → large gain), Mendel is intermediate, and Ökologie is smallest (ceiling effects).

GG2 – Paired Slopes per Topic

Three-panel slope plot (one per topic), with lines colored by medium. Medium mean trajectories are drawn as thick diamond markers and labeled with per-medium gains. The overall t-test / d / p annotation summarizes the within-topic effect. Ökologie clearly shows compressed trajectories compared to DNA-Replikation.

G-H. All Medium × Topic Combinations

GH1 – 3×3 Slope Grid (Medium × Topic)

A 3×3 grid with rows = mediums (Chat, Video, VR) and columns = topics (Mendel, DNA-Replikation, Ökologie). Each of the 9 cells shows individual participant Pre→Post-Tutoring slope lines (colored by topic), the medium mean (thick line), and the annotated effect size (d, p, n). This is the most granular view: VR × DNA-Replikation shows the largest gains while all mediums × Ökologie show compressed or near-zero gains.

G-I. Outlier Influence Analysis

Outliers are defined using the 1.5×IQR rule applied per medium on tutoring Score_Gain.

GI1 – Score Gain Scatter with Outlier Flags

Jittered scatter of individual score gains per medium. IQR fence lines (±1.5×IQR) are marked in red. Outlier points are highlighted in red and labeled with participant ID and topic name. P3/Ökologie (VR, +86.7 pp) is the most extreme single data point.

GI2 – Outlier Influence on Effect Sizes

Left: grouped bar chart of Cohen's d with All Data vs. Outliers Removed, annotated with raw gains and significance. Right: Δd bar chart showing the change in effect size after outlier removal per medium. A positive Δd means the outlier(s) were suppressing the true effect; a negative Δd means they were inflating it.

GI3 – Outlier Heatmap (Participant × Topic per Medium)

Heatmap of tutoring score gain for each participant × topic cell, one panel per medium. Color encodes gain magnitude (red–yellow–green). Cells with a red border are IQR outliers within that medium's distribution. Allows immediate identification of which participant-topic combinations drive extreme results.

F. Questionnaire Analysis

Questionnaires were administered at multiple phases: Pre-Reading, Post-Reading, Pre-Tutoring, and Post-Tutoring. Instruments include: IMI (Intrinsic Motivation Inventory, 26 items), SUS (System Usability Scale, 10 items), UEQ-S (User Experience Questionnaire – Short, 8 items), NASA-TLX (6 workload items), Godspeed (24 tutor impression items), Social Presence Legacy (5 items, VR-only), Cybersickness (5 binary items), IOS (Inclusion of Other in Self), plus stress/readiness/relaxation items and BFI-15 personality traits.

Questionnaire Summary Statistics

SUS Scores by Medium (Tutoring Only, 0–100 scale)

Medium	M	SD	Median	Interpretation
Chat	81.2	18.0	83.8	Good
Video	76.8	15.6	80.0	Above Average
VR	75.4	20.2	78.8	Above Average

IMI Subscales by Medium (Tutoring, 1–7 scale)

Subscale	Chat M (SD)	Video M (SD)	VR M (SD)
Interest/Enjoyment	4.48 (1.39)	3.86 (1.60)	4.24 (1.37)
Value/Usefulness	4.90 (1.68)	4.48 (1.49)	4.70 (1.43)
Perceived Choice	5.64 (1.20)	5.31 (1.31)	5.60 (1.14)

UEQ-S Overall by Medium (Tutoring, −3 to +3 scale)

Medium	M	SD	Interpretation
Chat	1.14	1.09	Good (>0.8)
Video	0.71	1.13	Neutral
VR	0.92	1.05	Good (>0.8)

NASA-TLX Overall Workload by Medium (Tutoring, 1–7 scale)

Medium	M	SD
Chat	3.36	0.82
Video	3.48	0.77
VR	3.40	0.99

Godspeed Tutor Impression by Medium (1–5 scale)

Medium	M	SD
Chat	3.23	0.54
Video	3.08	0.70
VR	3.15	0.50

Medium	M	SD
Chat	2.10	0.14
Video	—	—
VR	3.01	0.89

IOS (Closeness to Tutor) by Medium (1–7 scale)

Medium	M	SD
Chat	1.88	0.81
Video	1.89	1.37
VR	2.00	1.50

F1 – IMI Subscales: Reading vs Tutoring by Medium

All three IMI subscales are higher during the tutoring phase than the reading phase. Chat consistently scores the highest across Interest/Enjoyment (M=4.48) and Value/Usefulness (M=4.90), followed closely by VR. Video scores lowest on Interest/Enjoyment (M=3.86). Perceived Choice is high across all mediums (>5.3), indicating participants felt autonomy regardless of the tutoring format.

F2 – System Usability Scale (SUS) by Medium

Chat achieves the highest usability score (M=81.2), crossing the "Good" threshold (>80). Video (M=76.8) and VR (M=75.4) are both above average (>68) but below the "Good" cutoff. The higher Chat SUS score likely reflects the familiarity and simplicity of text-based interaction compared to video or VR interfaces.

F3 – UEQ-S: Pragmatic & Hedonic Quality

UEQ-S scores are centered (−3 to +3), with >0.8 indicating "good" quality. Chat leads on both pragmatic (functional) and hedonic (enjoyment) quality during tutoring, while the reading phase shows similar scores across all mediums. All tutoring mediums achieve positive UEQ-S scores, confirming a generally positive user experience.

F4 – NASA-TLX Workload by Medium

Workload subscale comparison across mediums during the tutoring phase. All three mediums have similar overall workload (~3.4/7). Notable differences: Video has the highest mental demand, VR has slightly higher physical demand (expected given headset use), and Chat has the highest temporal demand. Performance ratings (reversed: high = high workload) are comparable across mediums.

F5 – NASA-TLX: Reading vs Tutoring Comparison

Left panel: Overall workload is slightly higher during reading than tutoring across all mediums, suggesting the tutoring phase felt less demanding than independent reading. Right panel: Subscale profiles by medium during tutoring show that Video has a distinctly higher mental demand peak, while VR's profile is slightly elevated on physical demand.

F6 – Godspeed Tutor Impression by Medium

Tutor impression (Godspeed) subscales are moderate across all mediums (around 3/5). Chat scores highest on Perceived Intelligence (showing participants found the chat tutor most "smart"), while VR leads slightly on Animacy. Likeability and Anthropomorphism are fairly similar across mediums. Perceived Safety is high across all conditions.

Social Presence was primarily measured for participants who wore the Meta Quest Pro (VR condition). VR produces substantially higher social presence (M=3.01) than Chat (M=2.10). The data for Video is unavailable (not applicable). This confirms that VR creates a stronger sense of co-presence with the virtual tutor.

F8 – Cybersickness Symptoms by Medium

Cybersickness items are binary (Yes/No). The most commonly reported symptoms across all mediums are difficulty concentrating and eye strain. VR shows slightly elevated rates on most symptoms compared to Chat and Video, which is expected given the headset-based nature of VR interaction.

F9 – Pre-Session States: Stress, Readiness, Relaxation

Pre-session self-reports show that stress levels are low and comparable across all conditions and phases (Pre-Reading vs Pre-Tutoring). Readiness and relaxation are moderate-to-high. No significant differences between mediums in pre-session state, confirming that the counterbalanced design successfully controlled for mood/state confounds.

F10 – Additional Measures: IOS, Self-Use, Helpfulness

IOS scores are low across all mediums (~2/7), indicating participants did not feel particularly close to the tutoring agent. Self-reported willingness to use the tutoring method independently and perceived helpfulness are moderate, with Chat tending to score slightly higher on helpfulness.

F11 – Questionnaire Subscale Correlations & Learning Gain

Key correlations (per-participant averages, Pearson r with significance):

IMI Interest ↔ UEQ-S: r=0.83** — strong link between enjoyment and user experience
IMI Interest ↔ Godspeed: r=0.81** — participants who found the tutor more capable also enjoyed the session more
IMI Value ↔ Godspeed: r=0.75** — perceived usefulness correlates with positive tutor impression
SUS ↔ IMI Choice: r=0.69** — higher usability relates to greater perceived autonomy
NASA-TLX ↔ IMI Interest: r=−0.48* — higher workload is associated with lower enjoyment
SUS ↔ Score Gain: r=0.40 — moderate positive (non-significant) link between usability and learning gains
Social Presence excluded from this analysis (VR-only, insufficient cross-medium data)

F12 – Reading vs Tutoring Phase Comparison Dashboard

Four-panel dashboard comparing reading and tutoring phases across IMI subscales, NASA-TLX overall workload, UEQ-S overall quality, and pre-session stress. IMI subscales increase from reading to tutoring (participants found tutoring more engaging). Workload decreases slightly from reading to tutoring. UEQ-S shows divergence between mediums during tutoring (Chat highest, Video lowest). Pre-session stress remains stable.

F13 – VR-Specific Comparisons

VR-specific panel comparing social presence, cybersickness, and Godspeed across mediums. VR achieves the highest social presence (M=3.01 vs Chat M=2.10), moderate cybersickness symptoms, and Godspeed impressions comparable to the other mediums. The elevated social presence in VR without a corresponding increase in Godspeed tutor impression suggests that VR enhances the sense of "being there" without necessarily changing how the tutor is perceived.

F14 – Medium Preference Rankings (Friedman Test)

Within-subject ranking analysis: each of the 18 participants ranked the three tutoring mediums from 1 (most preferred) to 3 (least preferred). Left panel shows mean rank per condition with individual data points (jitter) and SEM error bars. Right panel shows the percentage of participants assigning each rank position to each condition. Chat received the lowest mean rank (M=1.61), indicating it was most often preferred first (50% of participants ranked it 1st). Video received the highest mean rank (M=2.33), most often placed last. A Friedman test showed no statistically significant difference across mediums (χ²(2) = 4.78, p = .092, Kendall's W = 0.13), likely due to limited power with N=18. Post-hoc Wilcoxon tests (Bonferroni-corrected α = .017) revealed a trend for Chat > Video (p = .027 raw, p = .080 adjusted) that did not survive correction.

Summary

Overall: Participants improved by +27.9 pp from Pre-Reading to Post-Tutoring (54.8% → 82.7%).
Tutoring phase: All three mediums produced positive learning gains. VR (+13.7 pp, d=0.62) and Chat (+11.1 pp, d=0.65) outperformed Video (+7.0 pp, d=0.36).
Confidence: Tracked test scores closely. All mediums increased confidence during tutoring, with Chat producing the largest boost (+1.40 on a 7-point scale).
Topics: DNA-Replikation showed the largest tutoring gains (+16.7 pp) from a low baseline, while Ökologie showed the smallest gains (+4.1 pp) likely due to ceiling effects.
Personality: Agreeableness was the only Big Five trait significantly associated with tutoring outcomes (confidence gain, r≈0.60, p<.05).
Usability (SUS): Chat rated highest (M=81.2, "Good"), Video (M=76.8) and VR (M=75.4) above average.
Motivation (IMI): Tutoring phase rated higher than reading phase on all subscales. Chat scored highest on Interest/Enjoyment (M=4.48) and Value/Usefulness (M=4.90).
User Experience (UEQ-S): Chat achieved "Good" quality (M=1.14), VR borderline good (M=0.92), Video neutral (M=0.71).
Workload (NASA-TLX): Similar across all mediums (~3.4/7). Tutoring felt slightly less demanding than reading.
Tutor Impression (Godspeed): Moderate across all mediums (~3.1/5), with Chat slightly ahead on perceived intelligence.
Social Presence: VR (M=3.01) substantially higher than Chat (M=2.10), confirming VR's advantage for co-presence.
Correlations: IMI Interest strongly correlates with UEQ-S (r=0.83) and Godspeed (r=0.81). Higher workload negatively correlates with enjoyment (r=−0.48). SUS shows a moderate positive link with learning gains (r=0.40).

README.md Unescape Escape

VirTu-Eval: Test Scores, Confidence & Questionnaire Analysis Report