Four times more accurate than a 10-year medical professional! Microsoft Releases Breakthrough Medical AI System

Microsoft corporationCEO, Chief Executive OfficerSatya NadellaShared on social media platforms, Microsoft's latest release of groundbreaking medicalAIsystemsMAI-DxO.

MAI-DxOThe biggest technological innovation is its model-independent design, which enables it to adapt language models from different vendors and capabilities and generally improve their diagnostic performance. And it can simulate the diagnostic process of a real doctor, with a higher accuracy rate than that of a professional doctor.

According to the test data released by Microsoft, there was a significant increase in the number of tests conducted with the21With more than10In a test comparison of specialized physicians with years of experience in medicine, human physicians in the New England Journal of Medicine56The average accuracy on the example hidden test set is only19.9%.

(indicates contrast)MAI-DxOIn the case of a no-budget configuration, the use ofOpenAI(used form a nominal expression)o3Modeling with an accuracy of up to81.9%In the case of the integrated mode, up to85.5%It's better than a professional doctor.4More than twice as much, and the cost side has dropped dramatically.

In addition, Microsoft has released a specialized medical sequential diagnostic benchmark, SDBench.

Medical diagnosis is a complex sequential process that requires physicians to collect patient information, formulate hypotheses, test them, and gradually refine the scope of the diagnosis.

In clinical practice, doctors need to ask a series of targeted questions based on the patient's initial signs and symptoms, further understanding of the patient's medical history, lifestyle habits, family history and other information, as well as combining the results of a variety of laboratory tests and imaging tests to gradually narrow down the range of possible diseases, and ultimately to determine an accurate diagnosis.

(indicates contrast)MAI-DxOSignificant breakthroughs in diagnostic accuracy and cost-effectiveness can be achieved primarily by simulating a group of virtual doctors with different roles working together to solve diagnostic problems.

MAI-DxObrief introduction

MAI-DxOThe core idea of the framework is to draw on models of collaboration in real healthcare teams, allowing different"surgeon"Leveraging the strengths of their respective specialties, they work together to support the diagnostic process, while avoiding problems such as individual cognitive bias and over-examination through well-designed coordination mechanisms.

First.Dr. HypothesisResponsible for maintaining a probabilistically-ordered list of differential diagnoses for the three most likely diseases and updating the probabilities of these diseases in a Bayesian manner each time a new finding is obtained. It ensures that the diagnostic process always has a clear diagnostic direction and is able to adjust the diagnostic assumptions in a timely manner based on new information, thus providing a basis for subsequent test selection and diagnostic decisions.

Dr. Test-ChooserUp to three diagnostic tests that maximize the differentiation of leading diagnostic hypotheses are selected in each round. By carefully selecting the tests, the aim is to obtain the most valuable diagnostic leads at minimal information cost, thereby increasing diagnostic efficiency and reducing unnecessary test costs. The existence of this role makes the entire diagnostic process more focused on the relevance and cost-effectiveness of the tests.

Dr. ChallengerInstead, they are in a supervisor capacity, responsible for identifying possible anchoring biases, pointing out evidence that contradicts the current leading diagnosis, and suggesting checks that would falsify the current leading diagnosis. This challenging mindset helps to break down stereotypes in the diagnostic process, prompting the team to look at diagnostic assumptions from multiple perspectives and avoiding premature identification of one diagnosis at the expense of other possibilities.

Dr. StewardshipA commitment to cost-conscious health care would advocate for diagnostically equivalent but cheaper test alternatives and veto those that are low-yield and expensive. By doing so, theDr. StewardshipIt ensures that the entire diagnostic process can effectively control costs and avoid waste of resources while pursuing accurate diagnosis. It makes the whole diagnostic process more in line with the principle of cost-effectiveness in the actual medical environment.

Dr. Checklist is responsible for behind-the-scenes quality control, ensuring that model-generated check names are valid and maintaining internal consistency throughout the team's reasoning process. The existence of this role helps to minimize diagnostic errors due to human error or logical inconsistencies, guaranteeing the stability and reliability of the entire diagnostic system.

In order to assessMAI-DxOperformance, Microsoft compared it comprehensively with professional human doctors. When it came to the key metric of diagnostic accuracy, the tests were conducted in the participating21of physicians, their average diagnostic accuracy is only19.9%.;

Under different configurations, theMAI-DxOAll demonstrated diagnostic accuracies well above those of human physicians. For example, in a no-budget configuration, theMAI-DxOThe diagnostic accuracy of the81.9%The integrated configuration improves the accuracy to85.5%.

In terms of cost-effectiveness, human physicians averaged $2,963 per case in the test. They spent an average of 11.8 minutes per case, asked 6.6 questions and requested 7.2 tests. In contrast, MAI-DxO performed much better at controlling costs. For example, the average examination cost for the no-budget configuration was $4,735, while the cost for the on-budget configuration was further reduced to $2,396, while still maintaining high diagnostic accuracy.

MAI-DxOFive Integration Models

in order toMAI-DxOadapted to different healthcare scenarios to control cost, diagnostic efficiency, accuracy, etc., ofMAI-DxOFive integration modes are provided.

Instant Answer model, which relies exclusively on the initial case summary to make a diagnosis without any subsequent questioning or examination operations. The design of this model is inspired by the need for rapid response to emergencies in clinical settings, such as remote areas with minimal resources or emergency scenarios, where physicians need to make rapid initial judgments based on limited information. Although its diagnostic accuracy is relatively low, it can provide a preliminary diagnostic direction based on model knowledge in the shortest possible time, which provides a basis for further treatment.

In terms of technical implementation, the model directly invokes the language model to process the initial information, and maximizes the use of diagnostic clues in the limited information by optimizing the structure of cue words, with the advantage of extremely fast diagnosis and almost zero cost, but with the disadvantage of being unable to deal with complex cases, and is suitable for special scenarios with extremely high time requirements and relatively low accuracy requirements.

Question Only The model, on the other hand, limits diagnostic tools strictly to questioning, does not allow ordering of any diagnostic tests, and the cost includes only a fixed fee for one doctor's consultation. This model simulates the process of obtaining diagnostic information through detailed history taking by a physician in a primary care scenario, emphasizing the fundamental role of history taking in diagnosis.

At the technical level, the model gradually digs deeper into patient information by constructing a hierarchical questioning strategy, starting with questions related to general conditions and complaints, and then gradually refining to specific symptom characteristics, past history, family history, etc. based on the answers. Its advantage is that it is inexpensive and non-invasive to the patient, which makes it suitable as an initial screening tool, but its diagnostic ability is limited when facing diseases that require objective examination evidence. The design concept of this model is to make full use of the conversational comprehension ability of the language model to maximize the amount of diagnostic information obtained from the patient's narrative by optimizing the questioning sequence, which is of great value in scenarios such as primary care and health consultation.

BudgetedThe model introduces a dynamic budget control mechanism that enables real-time tracking of cumulative diagnostic costs through an independently coordinated language model call. The model allows the team to see the estimated cost of each test and then decide whether to cancel that test, thus enabling proactive management of costs during the diagnostic process.

For the technical implementation, the model first converts the inspection request into a standardized one through natural language processing techniques CPT code and then based on the pre-built costcomprehensive databaseReal-time estimation is performed. When the accumulated costs approach the preset budget, the system triggers a cost warning mechanism, prompting the Dr. Stewardship The role is more rigorous in assessing the value for money of each test.

No Budget The model is thenMAI-DxO of the full form without setting explicit cost tracking or budget constraints, allowing diagnostic teams to make decisions with the goal of maximizing diagnostic accuracy. The model takes full advantage of the collaborative strengths of virtual physician teams by Dr. Hypothesis of Bayesian inference,Dr. Test-Chooser The information value assessment of theDr. Challenger of bias detection and other mechanisms to achieve in-depth diagnosis of complex cases.

In terms of technical implementation, the model optimizes the collaborative process between roles through reinforcement learning algorithms that continuously adjust diagnostic strategies to improve accuracy. Its advantage is that it can handle the most complex and difficult cases and achieve the highest diagnostic accuracy, but it may incur relatively high diagnostic costs. The model is suitable for scenarios that require very high diagnostic accuracy, for example, specialty consultation or rare disease diagnostic centers in tertiary hospitals, providing a refined diagnostic solution for complex cases without cost constraints.

EnsembleThe model further improves diagnostic accuracy by simulating multiple teams of physicians working in parallel, each operating independently No Budget model, and finally the diagnostic results are aggregated through an additional integrated panel. The technical core of this model is to construct diverse diagnostic teams, each of which may use different underlying models or parameter configurations, thus generating diagnostic ideas with differentiation. In the result aggregation phase, the system not only considers the consistency of the diagnostic results across the teams, but also evaluates the strength of the supporting evidence for each diagnosis and the rationality of the reasoning process. In this way, theEnsemble The model is able to effectively reduce the bias and errors that may occur in a single team and achieve further improvements in diagnostic accuracy.

Sequential diagnostic benchmarksSDBench

SDBenchMicrosoftAIThe team crafted an interactive assessment framework that incorporates the New England Journal of Medicine Clinical Pathology Conference (CPC) in the series304A number of challenging diagnostic cases are transformed into step-by-step diagnostic interactive scenarios. These cases cover a diverse range of clinical presentations, from common to rare diseases, and provide an opportunity for assessing the diagnostic subject (whether a human physician or aAI) The ability to diagnose sequentially provides rich and authentic material.

existSDBenchIn this case, the diagnostic process begins with a short case summary, such as"one29A year-old female was admitted to the hospital with a sore throat, peritonsillar swelling and bleeding, symptoms not relieved by antimicrobial therapy".

Based on this initial information, the diagnostic subject needs to decide which questions to ask the patient next, which tests to request, or whether it is ready to make a final diagnosis. This process is done iteratively, and each time the diagnostic subject makes a request, it is followed by a process called the"gatekeeper"of intelligent body models to respond.

The gatekeeper model is a specially designed language model that has the complete case documentation, including the final diagnosis, but will only provide information about the appropriate clinical findings based on explicit queries from the diagnostic subject, and will politely decline to answer a query if it is too vague or unspecific. This design simulates the process of obtaining patient information by a physician in a real clinical scenario and ensures that the diagnostic subject must gradually unravel the full picture of the case through reasonable, targeted questions and requests for examination.

In order to further enhance the authenticity of the assessment.SDBenchIt also introduces a"a judge"intelligences to assess the accuracy of the diagnosis. Since different physicians may use different terminology to describe the same disease, but their clinical management may be identical, the Judge Intelligent Body will not judge the correctness of the diagnosis based on the literal description alone, but will make a comprehensive assessment of the core disease entity, etiology, anatomical site, specificity, and many other dimensions.

For example, for a"bacterial endocarditis"diagnosis, even if the diagnostic subject uses the"Infective endocarditis caused by Staphylococcus aureus"Such a more specific description will be regarded by the judge-intelligence body as a correct diagnosis as long as its core diagnosis corresponds to the real situation. This clinical substance-based assessment more accurately reflects the actual diagnostic ability of the diagnostic subject and avoids misjudgments due to terminological differences.

In addition to diagnostic accuracy, SDBench also takes the cost incurred in the diagnostic process as an important evaluation index. In real-life clinical practice, doctors need to consider the cost-effectiveness of tests when choosing them, and cannot arbitrarily perform expensive tests.

Therefore.SDBenchA fixed cost is set for each diagnostic subject-patient interaction, and for requests for diagnostic tests, they are converted to standardized codes of current procedural terminology through a linguistic model-based lookup system based on a large U.S. health system's2023The annual pricing table is used to determine the corresponding costs. This type of cost assessment can not only prompt diagnostic subjects to pay more attention to cost control in the diagnostic process, but also provide a standardized reference for cost-effectiveness comparisons between different diagnostic subjects.

(Text: AIGC Open Community)

Newsflash # Medical AI # Microsoft

The copyright of the article belongs to the author, please do not reprint without permission.

Four times more accurate than a 10-year medical professional! Microsoft Releases Breakthrough Medical AI System

MAI-DxObrief introduction

MAI-DxOFive Integration Models

Sequential diagnostic benchmarksSDBench

Baidu Wenshin large model 4.5 series officially open source, synchronized open API services

Google Releases First Embedding Model: No. 1 in MTEB Rankings, Surpassing OpenAI

Related posts

AI Agent welcomes another giant! Google's heavyweight AI Agent , commercial explosion!

Google Gemini 2.0 Series of AI Models Released, Taking Programming and Reasoning Performance to the Next Level

"Strongest AI on Earth"? Musk's xAI officially releases Grok 3 big models

Samsung's "eventful fall": missed the AI wave, the market value of $ 122 billion evaporation

No comments

Popular Articles

Popular Sites