https://arxiv.org/pdf/2303.08774.pdf
We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.4 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C. Exams were sourced from publicly-available materials. Exam questions included both multiplechoice and free-response questions; we designed separate prompts for each format, and images were included in the input for questions which required it. The evaluation setup was designed based on performance on a validation set of exams, and we report final results on held-out test exams. Overall scores were determined by combining multiple-choice and free-response question scores using publicly available methodologies for each exam. We estimate and report the percentile each overall score corresponds to. See Appendix A for further details on the exam evaluation methodology.
'- 배움이 있는 삶 > - AI | Big data' 카테고리의 다른 글
응답시간(Response time), 처리시간(Processing time), 지연(Latency) (0) | 2024.03.22 |
---|---|
Open LLM Leaderboard (0) | 2024.03.21 |
Google Gemini vs Open AI GPT-4 성능 비교 평가 (0) | 2024.03.15 |
Chat GPT 성능평가 방법 (0) | 2024.03.07 |
AI Image generators (1) | 2024.01.24 |