renamed report to readme

This commit is contained in:
tom.hempel
2026-02-22 18:07:33 +01:00
parent 7b2392631d
commit deba6b4256

413
README.md Normal file
View File

@ -0,0 +1,413 @@
# VirTu-Eval: Test Scores, Confidence & Questionnaire Analysis Report
> **Study design**: Within-subjects, 18 participants × 3 topics (Mendel, DNA-Replikation, Ökologie) × 3 tutoring mediums (Chat, Video, VR), counterbalanced Latin-square.
> Each topic tested at 4 timepoints: Pre-Reading → Post-Reading → Pre-Tutoring → Post-Tutoring.
> Tests: 15 multiple-choice questions per test, with confidence ratings (17 scale) per question.
---
## Key Numbers
| Metric | Value |
|--------|-------|
| Participants | 18 |
| Total test entries | 216 (18 × 3 topics × 4 timepoints) |
| Overall start-to-finish gain (Pre-Reading → Post-Tutoring) | **+27.9 pp** (SD=19.9, t=10.32) |
| Overall tutoring gain (Pre-Tutoring → Post-Tutoring) | **+10.6 pp** |
| Highest tutoring gain by medium | VR: **+13.7 pp** (d=0.62) |
| Highest tutoring gain by topic | DNA-Replikation: **+16.7 pp** |
---
## Participant Scores Overview
### Scores by Participant, Topic & Medium
| Participant | Topic | Medium | Pre-Reading | Post-Reading | Pre-Tutoring | Post-Tutoring | Tutoring Gain |
|:-----------:|:------|:------:|:-----------:|:------------:|:------------:|:-------------:|:-------------:|
| P1 | DNA-Replikation | Video | 40.0 | 80.0 | 53.3 | 60.0 | +6.7 |
| P1 | Mendel | Chat | 60.0 | 73.3 | 80.0 | 93.3 | +13.3 |
| P1 | Ökologie | VR | 60.0 | 100.0 | 100.0 | 100.0 | 0.0 |
| P2 | DNA-Replikation | Video | 40.0 | 60.0 | 40.0 | 33.3 | 6.7 |
| P2 | Mendel | Chat | 53.3 | 80.0 | 60.0 | 66.7 | +6.7 |
| P2 | Ökologie | VR | 93.3 | 66.7 | 86.7 | 80.0 | 6.7 |
| P3 | DNA-Replikation | Video | 13.3 | 66.7 | 53.3 | 86.7 | +33.4 |
| P3 | Mendel | Chat | 60.0 | 60.0 | 53.3 | 86.7 | +33.4 |
| P3 | Ökologie | VR | 86.7 | 93.3 | 13.3 | 100.0 | +86.7 |
| P4 | DNA-Replikation | VR | 33.3 | 93.3 | 93.3 | 100.0 | +6.7 |
| P4 | Mendel | Video | 60.0 | 80.0 | 86.7 | 93.3 | +6.6 |
| P4 | Ökologie | Chat | 60.0 | 93.3 | 93.3 | 93.3 | 0.0 |
| P5 | DNA-Replikation | VR | 53.3 | 66.7 | 46.7 | 53.3 | +6.6 |
| P5 | Mendel | Video | 33.3 | 53.3 | 66.7 | 73.3 | +6.6 |
| P5 | Ökologie | Chat | 60.0 | 80.0 | 86.7 | 80.0 | 6.7 |
| P6 | DNA-Replikation | VR | 53.3 | 66.7 | 80.0 | 100.0 | +20.0 |
| P6 | Mendel | Video | 66.7 | 100.0 | 93.3 | 93.3 | 0.0 |
| P6 | Ökologie | Chat | 86.7 | 100.0 | 100.0 | 100.0 | 0.0 |
| P7 | DNA-Replikation | Chat | 60.0 | 33.3 | 60.0 | 60.0 | 0.0 |
| P7 | Mendel | VR | 60.0 | 86.7 | 80.0 | 80.0 | 0.0 |
| P7 | Ökologie | Video | 66.7 | 73.3 | 80.0 | 73.3 | 6.7 |
| P8 | DNA-Replikation | Chat | 40.0 | 20.0 | 26.7 | 86.7 | +60.0 |
| P8 | Mendel | VR | 53.3 | 86.7 | 60.0 | 80.0 | +20.0 |
| P8 | Ökologie | Video | 40.0 | 86.7 | 80.0 | 86.7 | +6.7 |
| P9 | DNA-Replikation | Chat | 20.0 | 60.0 | 66.7 | 86.7 | +20.0 |
| P9 | Mendel | VR | 0.0 | 53.3 | 33.3 | 66.7 | +33.4 |
| P9 | Ökologie | Video | 40.0 | 60.0 | 66.7 | 60.0 | 6.7 |
| P10 | DNA-Replikation | Video | 26.7 | 46.7 | 13.3 | 86.7 | +73.4 |
| P10 | Mendel | Chat | 66.7 | 80.0 | 73.3 | 86.7 | +13.4 |
| P10 | Ökologie | VR | 73.3 | 93.3 | 93.3 | 93.3 | 0.0 |
| P11 | DNA-Replikation | Video | 53.3 | 80.0 | 86.7 | 93.3 | +6.6 |
| P11 | Mendel | Chat | 46.7 | 80.0 | 80.0 | 80.0 | 0.0 |
| P11 | Ökologie | VR | 73.3 | 100.0 | 80.0 | 100.0 | +20.0 |
| P12 | DNA-Replikation | Video | 20.0 | 53.3 | 46.7 | 33.3 | 13.4 |
| P12 | Mendel | Chat | 53.3 | 86.7 | 73.3 | 93.3 | +20.0 |
| P12 | Ökologie | VR | 60.0 | 66.7 | 86.7 | 80.0 | 6.7 |
| P13 | DNA-Replikation | VR | 66.7 | 66.7 | 100.0 | 100.0 | 0.0 |
| P13 | Mendel | Video | 66.7 | 73.3 | 86.7 | 93.3 | +6.6 |
| P13 | Ökologie | Chat | 100.0 | 86.7 | 100.0 | 93.3 | 6.7 |
| P14 | DNA-Replikation | VR | 80.0 | 93.3 | 93.3 | 93.3 | 0.0 |
| P14 | Mendel | Video | 66.7 | 93.3 | 100.0 | 100.0 | 0.0 |
| P14 | Ökologie | Chat | 66.7 | 86.7 | 86.7 | 80.0 | 6.7 |
| P15 | DNA-Replikation | VR | 33.3 | 53.3 | 33.3 | 66.7 | +33.4 |
| P15 | Mendel | Video | 46.7 | 60.0 | 73.3 | 80.0 | +6.7 |
| P15 | Ökologie | Chat | 60.0 | 80.0 | 80.0 | 80.0 | 0.0 |
| P16 | DNA-Replikation | Chat | 46.7 | 40.0 | 53.3 | 80.0 | +26.7 |
| P16 | Mendel | VR | 46.7 | 66.7 | 66.7 | 80.0 | +13.3 |
| P16 | Ökologie | Video | 86.7 | 80.0 | 86.7 | 100.0 | +13.3 |
| P17 | DNA-Replikation | Chat | 40.0 | 53.3 | 46.7 | 66.7 | +20.0 |
| P17 | Mendel | VR | 53.3 | 80.0 | 60.0 | 80.0 | +20.0 |
| P17 | Ökologie | Video | 80.0 | 66.7 | 80.0 | 73.3 | 6.7 |
| P18 | DNA-Replikation | Chat | 26.7 | 86.7 | 86.7 | 93.3 | +6.6 |
| P18 | Mendel | VR | 46.7 | 93.3 | 93.3 | 93.3 | 0.0 |
| P18 | Ökologie | Video | 80.0 | 93.3 | 93.3 | 93.3 | 0.0 |
### Participant Summary
| Participant | N Tests | Avg Score % | Avg Confidence | Pre-Read | Post-Read | Pre-Tutor | Post-Tutor | Reading Gain | Tutoring Gain |
|:-----------:|:-------:|:-----------:|:--------------:|:--------:|:---------:|:---------:|:----------:|:------------:|:-------------:|
| P1 | 12 | 75.0 | 4.38 | 53.3 | 84.4 | 77.8 | 84.4 | +31.1 | +6.7 |
| P2 | 12 | 63.3 | 3.22 | 62.2 | 68.9 | 62.2 | 60.0 | +6.7 | 2.2 |
| P3 | 12 | 64.4 | 3.83 | 53.3 | 73.3 | 40.0 | 91.1 | +20.0 | +51.2 |
| P4 | 12 | 81.6 | 5.49 | 51.1 | 88.9 | 91.1 | 95.5 | +37.8 | +4.4 |
| P5 | 12 | 62.8 | 3.24 | 48.9 | 66.7 | 66.7 | 68.9 | +17.8 | +2.2 |
| P6 | 12 | 86.7 | 5.35 | 68.9 | 88.9 | 91.1 | 97.8 | +20.0 | +6.7 |
| P7 | 12 | 67.8 | 3.03 | 62.2 | 64.4 | 73.3 | 71.1 | +2.2 | 2.2 |
| P8 | 12 | 62.2 | 4.48 | 44.4 | 64.5 | 55.6 | 84.5 | +20.0 | +28.9 |
| P9 | 12 | 51.1 | 1.50 | 20.0 | 57.8 | 55.6 | 71.1 | +37.8 | +15.6 |
| P10 | 12 | 69.4 | 4.08 | 55.6 | 73.3 | 60.0 | 88.9 | +17.8 | +28.9 |
| P11 | 12 | 79.4 | 3.98 | 57.8 | 86.7 | 82.2 | 91.1 | +28.9 | +8.9 |
| P12 | 12 | 62.8 | 4.69 | 44.4 | 68.9 | 68.9 | 68.9 | +24.5 | 0.0 |
| P13 | 12 | 86.1 | 5.47 | 77.8 | 75.6 | 95.6 | 95.5 | 2.2 | 0.0 |
| P14 | 12 | 86.7 | 3.92 | 71.1 | 91.1 | 93.3 | 91.1 | +20.0 | 2.2 |
| P15 | 12 | 62.2 | 4.72 | 46.7 | 64.4 | 62.2 | 75.6 | +17.8 | +13.4 |
| P16 | 12 | 69.5 | 4.07 | 60.0 | 62.2 | 68.9 | 86.7 | +2.2 | +17.8 |
| P17 | 12 | 65.0 | 3.57 | 57.8 | 66.7 | 62.2 | 73.3 | +8.9 | +11.1 |
| P18 | 12 | 81.7 | 3.86 | 51.1 | 91.1 | 91.1 | 93.3 | +40.0 | +2.2 |
### Scores by Medium (aggregate)
| Medium | Avg Score % | Avg Confidence | Tutoring Gain | Cohen's d |
|:------:|:-----------:|:--------------:|:-------------:|:---------:|
| Chat | 70.8 (SD=20.5) | 4.09 (SD=1.78) | +11.1 pp | d=0.65 |
| Video | 68.5 (SD=22.2) | 3.99 (SD=1.80) | +7.0 pp | d=0.36 |
| VR | 73.6 (SD=22.3) | 4.06 (SD=1.64) | +13.7 pp | d=0.62 |
### Scores by Topic (aggregate)
| Topic | Avg Score % | Avg Confidence | Tutoring Gain |
|:-----:|:-----------:|:--------------:|:-------------:|
| Mendel | 71.8 (SD=18.4) | 4.15 (SD=1.75) | +11.1 pp (SD=10.7) |
| DNA-Replikation | 60.1 (SD=24.3) | 3.40 (SD=1.75) | +16.7 pp (SD=22.5) |
| Ökologie | 81.1 (SD=16.4) | 4.60 (SD=1.49) | +4.1 pp (SD=22.0) |
### Confidence by Medium (Tutoring Phase)
| Medium | Pre-Tutoring Conf | Post-Tutoring Conf | ΔConfidence |
|:------:|:-----------------:|:------------------:|:-----------:|
| Chat | 4.09 | 5.49 | +1.40 (d=1.20) |
| Video | 4.11 | 5.23 | +1.12 (d=1.19) |
| VR | 4.23 | 5.30 | +1.07 (d=1.09) |
---
## A. Overall Learning Trajectory
### A1 Overall Trajectory (Score + Confidence)
![Overall Learning Trajectory](Data/plots/A1_trajectory.png)
Scores rise from 54.8% (Pre-Reading) to 74.3% (Post-Reading), dip slightly to 72.1% (Pre-Tutoring due to the time gap between sessions), then climb to 82.7% (Post-Tutoring). Confidence tracks this pattern closely, increasing from 2.40 → 4.31 → 4.14 → 5.34 on the 17 scale. The reading phase accounts for the largest single jump (+19.5 pp), while tutoring adds another +10.6 pp.
### A2 Trajectory by Medium
![Trajectory by Medium](Data/plots/A2_trajectory_by_medium.png)
All three mediums show the same general upward trajectory. VR reaches the highest Post-Tutoring score (85.9%), followed by Chat (83.7%) and Video (78.5%). The Pre-Tutoring baselines are comparable (~72% for all three), so the medium differences emerge specifically during the tutoring phase.
### A3 Trajectory by Topic
![Trajectory by Topic](Data/plots/A3_trajectory_by_topic.png)
Ökologie starts highest (Pre-Reading 69.3%) and stays highest throughout, suggesting greater prior knowledge. DNA-Replikation starts lowest (42.6%) and shows the steepest climb, gaining +16.7 pp from tutoring alone. Mendel is intermediate. The topic-level differences highlight that DNA-Replikation has the most room for improvement, while Ökologie may suffer from ceiling effects.
### A4 Participant-Level Heatmaps
![Participant Heatmaps](Data/plots/A4_heatmap.png)
Heatmaps of individual scores and confidence across all 4 timepoints. Notable patterns: P3 has a dramatic 13.3% → 100.0% swing for Ökologie during tutoring; P8 shows a large Chat tutoring gain (26.7% → 86.7% in DNA-Replikation); P9 consistently low confidence (avg 1.50) despite moderate scores.
---
## B. Tutoring Phase Deep-Dive
### B1 Paired Slopes by Medium (with Statistics)
![Paired Slopes by Medium](Data/plots/B1_tutoring_slopes_by_medium.png)
Individual Pre → Post-Tutoring score changes per participant, grouped by medium. Each line is one participant-topic pair. Chat and VR show more upward slopes; Video has the most mixed pattern. Paired t-test results and Cohen's d annotated on each panel.
### B2 Paired Slopes by Topic
![Paired Slopes by Topic](Data/plots/B2_tutoring_slopes_by_topic.png)
Same paired-slope view, now grouped by topic. DNA-Replikation shows the most dramatic improvements (many steep upward lines from low baselines), while Ökologie has flatter slopes due to already-high Pre-Tutoring scores.
### B3 Tutoring Gains by Medium (Effect Sizes)
![Tutoring Gains by Medium](Data/plots/B3_tutoring_gain_by_medium.png)
Bar charts with individual data points showing score gains (left) and confidence gains (right) by medium. VR leads with +13.7 pp (d=0.62), Chat follows at +11.1 pp (d=0.65), and Video lags at +7.0 pp (d=0.36). Confidence gains are large across all mediums (d > 1.0), with Chat showing the highest confidence boost (+1.40, d=1.20).
### B4 Medium × Topic Interaction
![Medium × Topic Interaction](Data/plots/B4_tutoring_medium_topic.png)
Tutoring gains broken down by both medium and topic. The interaction reveals that gains vary considerably across topicmedium combinations. DNA-Replikation benefits most from tutoring regardless of medium, while Ökologie gains are smallest (ceiling effect).
### B5 Tutoring Effectiveness Dashboard
![Tutoring Dashboard](Data/plots/B5_tutoring_dashboard.png)
Six-panel dashboard combining: Pre vs Post scores, gain distributions, medium comparison, score vs gain relationship, confidence change, and individual trajectories. Provides a comprehensive at-a-glance view of tutoring effectiveness.
---
## C. Start-to-Finish Gains
### C1 Pre-Reading to Post-Tutoring (Paired)
![Start-to-Finish Paired](Data/plots/C1_start_to_finish.png)
Each line connects a participant's Pre-Reading score to their Post-Tutoring score for each topic. The overall gain of +27.9 pp (t=10.32, p<.001) represents the full learning effect of reading + tutoring combined. Nearly all lines slope upward, demonstrating consistent learning across participants.
### C2 Learning Gains Overview
![Learning Gains Overview](Data/plots/C2_learning_gains.png)
Side-by-side comparison of reading gains vs tutoring gains across mediums and topics. The reading phase contributes more absolute score improvement on average (+19.5 pp) than the tutoring phase (+10.6 pp), but tutoring builds on already-higher baselines and adds further consolidation.
---
## D. Confidence Analysis
### D1 Confidence vs Test Score (Scatter)
![Confidence vs Score Scatter](Data/plots/D1_confidence_vs_score.png)
Strong positive correlation between test scores and average confidence ratings. Participants who score higher also report higher confidence. This holds across all timepoints, though the relationship is tightest at Post-Tutoring when both scores and confidence are highest.
### D2 Change in Confidence vs Change in Score
![Delta Confidence vs Delta Score](Data/plots/D2_delta_conf_vs_score.png)
During tutoring, score gains and confidence gains are positively correlated participants who improved their scores also became more confident. However, some participants show large confidence increases even with modest score gains, suggesting tutoring boosts metacognitive awareness beyond pure knowledge gains.
### D3 Confidence Calibration
![Confidence Calibration](Data/plots/D3_calibration.png)
Calibration analysis comparing actual performance to self-reported confidence. Participants tend to be slightly under-confident at Pre-Reading (low confidence, moderate scores) and approach better calibration by Post-Tutoring. This indicates that the full learning journey improves not just knowledge but also self-assessment accuracy.
---
## E. Personality Correlations
### E1 Big Five vs Tutoring Outcomes (Heatmap)
![Personality Correlation Heatmap](Data/plots/E1_personality_correlations.png)
Pearson correlation heatmap between Big Five personality traits and tutoring outcomes (score gain, confidence gain, post-tutoring score, post-tutoring confidence). Notable finding: Agreeableness shows a significant positive correlation with confidence gain (r0.60, p<.05), suggesting that more agreeable participants showed larger confidence boosts from tutoring.
### E2 Trait vs Tutoring Score Gain
![Trait vs Score Gain](Data/plots/E2_trait_vs_score_gain.png)
Scatter plots of each Big Five trait against tutoring score gain, with regression lines and correlation coefficients. Most personality traits show weak relationships with score gains, confirming that tutoring effectiveness is relatively independent of personality in this sample. The strongest trend is Agreeableness confidence gain rather than score gain.
---
## F. Questionnaire Analysis
> Questionnaires were administered at multiple phases: Pre-Reading, Post-Reading, Pre-Tutoring, and Post-Tutoring.
> Instruments include: IMI (Intrinsic Motivation Inventory, 26 items), SUS (System Usability Scale, 10 items), UEQ-S (User Experience Questionnaire Short, 8 items), NASA-TLX (6 workload items), Godspeed (24 tutor impression items), Social Presence Legacy (5 items, VR-only), Cybersickness (5 binary items), IOS (Inclusion of Other in Self), plus stress/readiness/relaxation items and BFI-15 personality traits.
### Questionnaire Summary Statistics
#### SUS Scores by Medium (Tutoring Only, 0100 scale)
| Medium | M | SD | Median | Interpretation |
|:------:|:---:|:----:|:------:|:--------------:|
| Chat | 81.2 | 18.0 | 83.8 | Good |
| Video | 76.8 | 15.6 | 80.0 | Above Average |
| VR | 75.4 | 20.2 | 78.8 | Above Average |
#### IMI Subscales by Medium (Tutoring, 17 scale)
| Subscale | Chat M (SD) | Video M (SD) | VR M (SD) |
|:---------|:----------:|:-----------:|:---------:|
| Interest/Enjoyment | 4.48 (1.39) | 3.86 (1.60) | 4.24 (1.37) |
| Value/Usefulness | 4.90 (1.68) | 4.48 (1.49) | 4.70 (1.43) |
| Perceived Choice | 5.64 (1.20) | 5.31 (1.31) | 5.60 (1.14) |
#### UEQ-S Overall by Medium (Tutoring, 3 to +3 scale)
| Medium | M | SD | Interpretation |
|:------:|:---:|:----:|:--------------:|
| Chat | 1.14 | 1.09 | Good (>0.8) |
| Video | 0.71 | 1.13 | Neutral |
| VR | 0.92 | 1.05 | Good (>0.8) |
#### NASA-TLX Overall Workload by Medium (Tutoring, 17 scale)
| Medium | M | SD |
|:------:|:---:|:----:|
| Chat | 3.36 | 0.82 |
| Video | 3.48 | 0.77 |
| VR | 3.40 | 0.99 |
#### Godspeed Tutor Impression by Medium (15 scale)
| Medium | M | SD |
|:------:|:---:|:----:|
| Chat | 3.23 | 0.54 |
| Video | 3.08 | 0.70 |
| VR | 3.15 | 0.50 |
#### Social Presence by Medium (15 scale, VR-only: N=17)
| Medium | M | SD |
|:------:|:---:|:----:|
| Chat | 2.10 | 0.14 |
| Video | — | — |
| VR | 3.01 | 0.89 |
#### IOS (Closeness to Tutor) by Medium (17 scale)
| Medium | M | SD |
|:------:|:---:|:----:|
| Chat | 1.88 | 0.81 |
| Video | 1.89 | 1.37 |
| VR | 2.00 | 1.50 |
---
### F1 IMI Subscales: Reading vs Tutoring by Medium
![IMI by Medium](Data/plots_questionnaires/Q01_imi_by_medium.png)
All three IMI subscales are higher during the tutoring phase than the reading phase. Chat consistently scores the highest across Interest/Enjoyment (M=4.48) and Value/Usefulness (M=4.90), followed closely by VR. Video scores lowest on Interest/Enjoyment (M=3.86). Perceived Choice is high across all mediums (>5.3), indicating participants felt autonomy regardless of the tutoring format.
### F2 System Usability Scale (SUS) by Medium
![SUS by Medium](Data/plots_questionnaires/Q02_sus_by_medium.png)
Chat achieves the highest usability score (M=81.2), crossing the "Good" threshold (>80). Video (M=76.8) and VR (M=75.4) are both above average (>68) but below the "Good" cutoff. The higher Chat SUS score likely reflects the familiarity and simplicity of text-based interaction compared to video or VR interfaces.
### F3 UEQ-S: Pragmatic & Hedonic Quality
![UEQ-S by Medium](Data/plots_questionnaires/Q03_ueqs_by_medium.png)
UEQ-S scores are centered (3 to +3), with >0.8 indicating "good" quality. Chat leads on both pragmatic (functional) and hedonic (enjoyment) quality during tutoring, while the reading phase shows similar scores across all mediums. All tutoring mediums achieve positive UEQ-S scores, confirming a generally positive user experience.
### F4 NASA-TLX Workload by Medium
![NASA-TLX by Medium](Data/plots_questionnaires/Q04_nasatlx_by_medium.png)
Workload subscale comparison across mediums during the tutoring phase. All three mediums have similar overall workload (~3.4/7). Notable differences: Video has the highest mental demand, VR has slightly higher physical demand (expected given headset use), and Chat has the highest temporal demand. Performance ratings (reversed: high = high workload) are comparable across mediums.
### F5 NASA-TLX: Reading vs Tutoring Comparison
![NASA-TLX Comparison](Data/plots_questionnaires/Q05_nasatlx_comparison.png)
Left panel: Overall workload is slightly higher during reading than tutoring across all mediums, suggesting the tutoring phase felt less demanding than independent reading. Right panel: Subscale profiles by medium during tutoring show that Video has a distinctly higher mental demand peak, while VR's profile is slightly elevated on physical demand.
### F6 Godspeed Tutor Impression by Medium
![Godspeed by Medium](Data/plots_questionnaires/Q06_godspeed_by_medium.png)
Tutor impression (Godspeed) subscales are moderate across all mediums (around 3/5). Chat scores highest on Perceived Intelligence (showing participants found the chat tutor most "smart"), while VR leads slightly on Animacy. Likeability and Anthropomorphism are fairly similar across mediums. Perceived Safety is high across all conditions.
### F7 Social Presence by Medium
![Social Presence by Medium](Data/plots_questionnaires/Q07_social_presence_by_medium.png)
Social Presence was primarily measured for participants who wore the Meta Quest Pro (VR condition). VR produces substantially higher social presence (M=3.01) than Chat (M=2.10). The data for Video is unavailable (not applicable). This confirms that VR creates a stronger sense of co-presence with the virtual tutor.
### F8 Cybersickness Symptoms by Medium
![Cybersickness by Medium](Data/plots_questionnaires/Q08_cybersickness_by_medium.png)
Cybersickness items are binary (Yes/No). The most commonly reported symptoms across all mediums are difficulty concentrating and eye strain. VR shows slightly elevated rates on most symptoms compared to Chat and Video, which is expected given the headset-based nature of VR interaction.
### F9 Pre-Session States: Stress, Readiness, Relaxation
![Pre-Session States](Data/plots_questionnaires/Q09_pre_session_states.png)
Pre-session self-reports show that stress levels are low and comparable across all conditions and phases (Pre-Reading vs Pre-Tutoring). Readiness and relaxation are moderate-to-high. No significant differences between mediums in pre-session state, confirming that the counterbalanced design successfully controlled for mood/state confounds.
### F10 Additional Measures: IOS, Self-Use, Helpfulness
![Additional Measures](Data/plots_questionnaires/Q10_additional_measures.png)
IOS scores are low across all mediums (~2/7), indicating participants did not feel particularly close to the tutoring agent. Self-reported willingness to use the tutoring method independently and perceived helpfulness are moderate, with Chat tending to score slightly higher on helpfulness.
### F11 Questionnaire Subscale Correlations & Learning Gain
![Correlation Heatmap](Data/plots_questionnaires/Q11_correlation_heatmap.png)
Key correlations (per-participant averages, Pearson r with significance):
- **IMI Interest ↔ UEQ-S**: r=0.83** — strong link between enjoyment and user experience
- **IMI Interest ↔ Godspeed**: r=0.81** — participants who found the tutor more capable also enjoyed the session more
- **IMI Value ↔ Godspeed**: r=0.75** — perceived usefulness correlates with positive tutor impression
- **SUS ↔ IMI Choice**: r=0.69** — higher usability relates to greater perceived autonomy
- **NASA-TLX ↔ IMI Interest**: r=0.48* — higher workload is associated with lower enjoyment
- **SUS ↔ Score Gain**: r=0.40 — moderate positive (non-significant) link between usability and learning gains
- Social Presence excluded from this analysis (VR-only, insufficient cross-medium data)
### F12 Reading vs Tutoring Phase Comparison Dashboard
![Phase Comparison Dashboard](Data/plots_questionnaires/Q12_phase_comparison_dashboard.png)
Four-panel dashboard comparing reading and tutoring phases across IMI subscales, NASA-TLX overall workload, UEQ-S overall quality, and pre-session stress. IMI subscales increase from reading to tutoring (participants found tutoring more engaging). Workload decreases slightly from reading to tutoring. UEQ-S shows divergence between mediums during tutoring (Chat highest, Video lowest). Pre-session stress remains stable.
### F13 VR-Specific Comparisons
![VR-Specific Analysis](Data/plots_questionnaires/Q13_vr_specific.png)
VR-specific panel comparing social presence, cybersickness, and Godspeed across mediums. VR achieves the highest social presence (M=3.01 vs Chat M=2.10), moderate cybersickness symptoms, and Godspeed impressions comparable to the other mediums. The elevated social presence in VR without a corresponding increase in Godspeed tutor impression suggests that VR enhances the sense of "being there" without necessarily changing how the tutor is perceived.
---
## Summary
- **Overall**: Participants improved by **+27.9 pp** from Pre-Reading to Post-Tutoring (54.8% → 82.7%).
- **Tutoring phase**: All three mediums produced positive learning gains. **VR** (+13.7 pp, d=0.62) and **Chat** (+11.1 pp, d=0.65) outperformed **Video** (+7.0 pp, d=0.36).
- **Confidence**: Tracked test scores closely. All mediums increased confidence during tutoring, with Chat producing the largest boost (+1.40 on a 7-point scale).
- **Topics**: DNA-Replikation showed the largest tutoring gains (+16.7 pp) from a low baseline, while Ökologie showed the smallest gains (+4.1 pp) likely due to ceiling effects.
- **Personality**: Agreeableness was the only Big Five trait significantly associated with tutoring outcomes (confidence gain, r≈0.60, p<.05).
- **Usability (SUS)**: Chat rated highest (M=81.2, "Good"), Video (M=76.8) and VR (M=75.4) above average.
- **Motivation (IMI)**: Tutoring phase rated higher than reading phase on all subscales. Chat scored highest on Interest/Enjoyment (M=4.48) and Value/Usefulness (M=4.90).
- **User Experience (UEQ-S)**: Chat achieved "Good" quality (M=1.14), VR borderline good (M=0.92), Video neutral (M=0.71).
- **Workload (NASA-TLX)**: Similar across all mediums (~3.4/7). Tutoring felt slightly less demanding than reading.
- **Tutor Impression (Godspeed)**: Moderate across all mediums (~3.1/5), with Chat slightly ahead on perceived intelligence.
- **Social Presence**: VR (M=3.01) substantially higher than Chat (M=2.10), confirming VR's advantage for co-presence.
- **Correlations**: IMI Interest strongly correlates with UEQ-S (r=0.83) and Godspeed (r=0.81). Higher workload negatively correlates with enjoyment (r=0.48). SUS shows a moderate positive link with learning gains (r=0.40).