
What is science, science of security?

Minimally, to count as scientific, we expect a theory to have the following properties:Consistency: claims are consistent with other claims and available observations. Inconvenient observations are not discarded.Falsifiability (see above): we can describe the evidence that would prove claims wrong. Without this we are not self-correcting [14].Predictive power and progress: models and theories should facilitate accurate predictions, and the set of observations that can be accurately predicted should generally
increase over time. We should not be asking all of the
same questions year after year


scientific method. The idea is to attempt to generalize, making falsifiable statements that are consistent with what we have already observed, but predict also things not yet observed. Then seek new observations, especially those expected to present severe tests of predictions (rather than those expected to corroborate them). Often called the hypothetico-deductive model, the summary is:
1) Form hypotheses from what is observed.
2) Formulate falsifiable predictions from those hypotheses.
3) If new observations agree with the predictions, an hypothesis is supported (but not proved); if they disagree, it is rejected.

Note that this process of iteratively eliminating possibilities that conflict with observations is the essence of differential diagnosis in medicine, sensible approaches to car repair, and the investigative method Sherlock Holmes recommends to Watson: “Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.”  


While many open questions remain in the Philosophy of Science, much is also settled. Far less is settled in discussing Science of Security; what emerges from review of the security literature below is, in many places, an absence of consensus; we return later to consider if this signals an immature science. 
A. Science of Security: Early Search and Misunderstandings
Multics seniors still remind the young that today’s problems are not only 40 years old, but were better addressed by Multics  
He notes, on the challenge of using system models, that properties proven “may or may not hold for the real system depending on the accuracy of the model”, and that for high-level abstractions,What is important is that the abstractions are done in such a way so that when we prove some property about the abstraction, then that property is true of the real, running system。”

B. Science of Security: Recent EffortsHere we selectively review research since 2008 under the label “Science of Security”; our goal is not an encyclopedic review per se, but to provide context for later observations。
The desire to do security more scientifically fits into a larger picture of frustration voiced by experts—e.g., in accepting the Turing award in 2002, Shamir made a set of 10-year predictions that included “the non-crypto part of security will remain a mess.”  
JASON was requested by the DoD to examine the theory and practice of cyber-security, and evaluate
whether there are underlying fundamental principles that would make it possible to adopt a more scientific approach, identify what is needed in creating a science of cyber-security, and recommend specific ways in which scientific methods can be applied..
The science seems under-developed in reporting experimental results, and consequently in the ability
to use them. The research community does not seem to have developed a generally accepted way of
reporting empirical studies so that people could reproduce the work.



crypto(provable security)有的人觉得是科学,有的人说不是,因为crypto并不能保证real system的安全性,还有其他系统实现的东西要考虑。

We now detail security research failures to adopt accepted lessons from the history and philosophy of science.

A. Failure to observe inductive-deductive split
Rather, the question is to what degree properties proven about a mathematical system can be  translated into useful properties of a real-world one.  If security is proved in the mathematical sense, then it can’t refer to a realworld property. (e.g. 128bit key more secure than 64bit)

B. Reliance on unfalsifiable claims
For example, to falsify “in order to be secure you must do X” we would have to observe something
secure that doesn’t do X. If we interpret “secure” as a realworld property, such as the avoidance of future harm, then observing it requires knowing the future. On the other hand, if “secure” is interpreted formally, while we can now identify mathematically secure systems, we can make no deductions about real-world events (e.g., that harm will be avoided). A similar argument shows that claims of the form “X improves security” are unfalsifiable.
In summary, claims of necessary conditions for real-world security are unfalsifiable. 

C. Failure to bring theory into contact with observation
A scientific model is judged on the accuracy of its predictions (Section II-C1); lack of data or difficulty in making measurements does not justify trusting a model on the sole basis of its assumptions appearing reasonable. But this is often done in security research.

Community actions were based on the assumed truth of something that depended critically on an untested assumption.  

D. Failure to make claims and assumptions explicit the evidence falsifying a precise claim is easily described. If a theory says “X should never happen under assumptions A, B and C” then showing that it does suffices to refute the claim. But when a statement is vague, or assumptions implicit, it is unclear what, if anything, is ruled out. Thus, difficulty articulating what evidence would falsify a claim suggests implicit assumptions or an imprecise theory [3].  
The problem of implicit assumptions seems widespread. (e.g. less user click on link after notification implies secure, that raising awareness of cyber threats or paying more attention to warnings is
inherently beneficial. )

E. Failure to seek refutation rather than confirmation The limitations of formal approaches noted in Section IV-A might lead to belief that empiricism wins—that measurement and experimentation are the clear way forward for pursuing security scientifically. The truth appears more complex. Recall that in the hypothetico-deductive model (Section II-E), hypotheses are most useful when they allow anticipation of as-yet unseen things, and observations are most useful when they present severe tests to existing hypotheses (vs. simply corroborating existing beliefs). If that model is not to be a random walk, observations must actively seek to refute existing belief (see Section II-D).  

V. WAYS FORWARD: INSIGHTS AND DISCUSSION T1: Pushes for “more science” in security, that rule nothing in or out, are too ambiguous to be effective. Many insights and methods from philosophy of science remain largely unexplored
in security research.

Recalling Popper’s view that to count as scientific a statement has to “stick its neck out” and be exposed to risk, we suggest that the same is true of pursuing security scientifically: to be effective, calls for more science should specify desired attributes, specific sources of dis-satisfaction with current research, and preferred types of research. 


T2: Ignoring the sharp distinction between inductive and deductive statements is a consistent source of confusion in security 。


T3: Unfalsifiable claims are common in security—and they, along with circular arguments, are used to justify many defensive measures in place of evidence of efficacy  

T4: Claims that unique aspects of security exempt it from practices ubiquitous elsewhere in science are unhelpful and divert attention from identifying scientific approaches that advance security research.  

T5: Physics-envy is counterproductive; seeking “laws of cybersecurity” similar to physics is likely to be a fruitless search  

T6: Crypto-envy is counterproductive; many areas of security, including those involving empirical research, are less amenable to formal treatment or mathematical role models.  

The main point is that despite many pointing to crypto as role-model for a Science of Security, its methods are less suitable for numerous areas, e.g., systems security and others involving empirical research.  

T7: Both theory and measurement are needed to make progress across the diverse set of problems in security research.  

T8: More security research of benefit to society may result if researchers give precise context on how their work fits into full solutions—to avoid naive claims of providing key components, while major gaps mean full-stack solutions never emerge。

T9: Conflating unsupported assertions, and argument-byauthority, with evidence-supported  tatements, is an avoidable error especially costly in security  

“Before the underlying science is developed, engineers often invent rules of thumb and best practices that have proven useful, but may not always work.”  
In summary, scientific statements stand or fall on how they agree with evidence. Calling something a principle, best-practice, rule-of-thumb, or truism removes no burden of providing supporting evidence 。

T10: Despite consensus that assumptions need be carefully detailed, undocumented and implicit  assumptions are common in security research。

Connections between abstractions and the real world (Section II-C) are often unchecked or loose in security。Platt [26] recommends answering either “what experiment would disprove your  hypothesis” or “what hypothesis does your experiment disprove.” 

T11: Science prioritizes efforts at refutation. Empirical work that aims only to verify existing beliefs, but does not suggest new theory or disambiguate possibilities falls short of what science can deliver  

In science, there is an expectation to seek refuting observations, as discussed in Sections II-D, IV-E. Corroborating evidence is never definitive, whereas refuting evidence is.  

From the preceding points, some overall observations emerge related directly to Science. A first meta-observation is that the Security community is not learning from history lessons well-known in other sciences. security research is learning neither from other disciplines nor its own literature, and questioning security foundations is not new .

A second meta-observation pertains to those seeing the endgoal of security research being to ultimately improve outcomes in the real world. The failure to validate the mapping of models
and assumptions onto environments and systems in the real world has resulted in losing the connections needed to meet this end-goal. A rigorous proof of security of a mathematical
system allows guarantees about a real-world system only if the coupling between them is equally rigorous. We have seen repeated failure in poor connections between mathematical systems and real-world ones, and consequent failure of the latter to enjoy properties promised by the former. 
That the Security community is experiencing problems historically well-known in other scientific fields is unsurprising—and perhaps even supports claims of being a Science. What is harder to accept is apparent unawareness or inability to better leverage such lessons.  We have noted the absence of consensus in many areas of Security, which some might take as signaling an immature field. 

On a positive note, one point of consensus is that security research is still in early days. Those who pursue a Science of Security should be cognizant of history—including that progress in science is neither steady nor straight-line. Simply wishing for a Science of Security will not make it happen.
What is needed is for security researchers to learn and adopt more scientific methodologies. Specific guidance on what those are, and training in recognizing and using them, may help security research become more scientific




Active Cyber Defence (ACD)

Asset Mapping and ACD

Asset Mapping- deep understanding of the cyber resources u r responsible for.


New concept developed by DARPA and US Air Force that is now being adapted within critical infrastructures and traditional public and private businesses.

The core concept is the anticipation of an attack against cyber assets and proactively preparing and responding

This talk focus on asset identification and resulting relationship with all other components of ACD.

Asset identification/ Network secuirty monitoring
Incident response
Threat identification and Manipulation
Threat intelligence consuption

Asset Mapping / Network security monitoring:
Detailed active and passive mapping of systems, networks and their "normal behavior" prior to an attack.
Detection of aberrant behavior (known signatures, heuristics, unusual changes or activity)

detailed active and passive mapping of systems and networks, not static snapshot(photo) but dynamic behavior monitor(video). Detect abnormal behavior (known signatures, heuristics, unusual changes or activity) compared to normal behavior. Challenge: diff vendor and system, so hard to do monitoring and analysis, new features added, system changes, how to get rid of false positive: identifying truely unique features/behaviors. Use variance using stats tools.
Using current tech but focus more on behavior not static data. Feature extraction of monitored behavior then classify those and create knowledgebase. Then train new tech to add to current methods. Now we can add new defense earlier. Train Neural net.

Incident response:
Containment 围堵措施;遏制措施;
Eradication 摧毁,根除
Post Response

When an incident identified by asset mapping/ monitoring. Dynamic change the system (restore back to valid state, reconfigure to defend, share threat information).

Threat identification and Manipulation (start to get offensive):
Examing attack methods
identifying objectives
gaming the adversary : implement deceptive tech, create darknets in the internal network, nothing or noone should ever touch those ip. honeypot honeyfile and sensors on the darknets
identifying key characteristics
assessing the sophistication of the attack and attacker (analysis the malware, to know its)
determining location of attackers

Threat intelligence consuption (a lot of money):
Interpret threat information:
does this threat impact? how? do we want change to our system?

why Asset mapping vital?
1. without it, the response to an attack is hampered by the lack of knowledage of the environment (potentially hack back to third party network)
2. Our ability to understand our adversary's methods or objectives would be imprecise and lethargic
3. Our ability to employ deceptive tech(traps, decoys etc) in order to game the adversary would be quite different.
4. Our ability to assess and consure threat intelligence reports and determine if/ how the report applies to our environment would be difficult.

Example 2014, heartbleed. (Many companies cannot answer whether they include openssl)
Apply ACD to heartbleed like incidents
1. Significantly improve asset identification that includes much deeper knowledge regarding the operating characteristics and software installed on every device.
2. Improvement in mapping of identified threat characteristics and CVE's to ICS assets during the threat intelligence consumption phase
3. Expansion of Alternate Operation Centers and Test Environments to allow for rapid testing of patches and updates along with the ability to actively scan these environments to better assess the impacts of the vulnerabilities/exploits in question as an early step in Incident Response
4. Improve the ability to instument network security monitoring capabilities with the latest threat data.
5. Improvement in the dissemination and specific details of threats like Heartbleed that provide Threat Identification details that fit into the ACDC cycle

Some examples of industrial control system(ICS) technologies are:
  • SCADA – Supervisory Control and Data Acquisition
  • DCS – Distributed Control System
  • PLC – Programmable Logic Controller
hack-back disadvantage:
1. illegal
2. attribution (if make mistake, harm third party)

vs deceptive tech.

Active vs passive mapping:
Immediate Results                                                                  Non-Intrusive
Trusted Tech                                                                           Provides a broader view of system and network behaviors
Deep Results(Quickly)                                                           Provides details that can be missed by active mapping

Intrusive and cause system crashes                                        Slow- cna take days or weeks to construct an accurate picture
Only provides a snapshot of results                                        More difficult to develop detailed OS characteristics
can be fooled by sophisticated malware & APTs                   Requires access to massive amount of traffic (Large PCAPS)

Takewayskey considerations and critical thinking:

1. Anticipate intrusions and adapt to them in real-time

2. Continuous asset behavior mapping is vital - make sure papssive mapping is a key component in ur strategy

3. Integrate lures, traps and decoys into ur defensive strategy

4. When consuming threat intelligence make sure ur understand the potential direct impact on ur infrastructure

5. share key threat observations with peers and even competitors 

The “Triptych of Cyber Security”: A Classification of Active Cyber Defence  

categorising non-ACD measures as fortified cyber defence and resilient cyber defence instead of passive cyber defense.

An examination of the cyber security strategies of national actors will demonstrate that active, fortified and resilient cyber defence are employed in a collaborative triptych of approaches to cyber security: three independent but related concepts coming together to achieve the single goal of operating in cyberspace free from the risk of physical or digital harm.

“active defence” is common in the military as the idea of offensive action and counterattacks to deny advantage or position to the enemy6, the concept remains elusive when applied to the cyber domain7 and suffers a lack of clarity in related law and national policy8

“…the synchronized, real_time capability to discover, detect, analyze, and mitigate threats. [Active cyber defence] operates at network speed using sensors, software and intelligence to detect and stop malicious activity ideally before it can affect networks and systems.”

 This definition identifies a number of features of ACD, the most important of which is the realtime detection and mitigation of key threats before damage occurs. Specific measures include the deployment of “white worms”11, benign software similar to viruses but which seek out and destroy malicious software, identify intrusions12 or engage in recovery procedures13. A second active defence tactic is to repeatedly change the target device’s identity during data transmission, a process known as address hopping14. This has the dual role of masking the target’s identifying characteristics as well as confusing the attacker15. Address hopping can serve as a useful action to counter espionage by masking the identities of devices where particular data is stored. Active cyber defence therefore places emphasis on proactive measures to counteract the immediate effects of a cyber-incident, either by identifying and neutralising malicious software or by deliberately seeking to mask the online presence of target devices to deter and counter espionage.

There are, however, a number of more aggressive measures which can be taken to defend systems and networks. While white worms can be used to seek out and combat malicious software, Curry and Heckman describe how they can also be used to turn the tools of hackers and would-be intruders against them and identify not just the attacking software, but the servers  and other hardware devices hosting and distributing the attacking code16. This is a process known as “hack-back”17. Once the source devices of an intrusion or attack have been identified steps can be taken to render those devices inoperative or otherwise prevent them from carrying out their goals. What makes these measures significant is that they are aggressive, offensive techniques which operate beyond the boundaries of the defender’s network18. They are taking the fight to the attackers.

ACD is therefore a security paradigm employing two methods: one, the real-time identification and mitigation of threats in defenders’ networks; two, the capacity to take aggressive, external offensive countermeasures. For the purposes of establishing, or at least beginning the process of developing, a lexicon of cyber security terminology, ACD can therefore be described as:

" an approach to achieving cyber security predicated upon the deployment of measures to detect, analyse, identify and mitigate threats to and from communications systems and networks in real-time, combined with the capability and resources to take proactive or offensive action against threats and threat entities including action in those entities’ home networks.  "

4 issues:
- legal implications in the use of offensive external measures.
- cannot attribute with 100%. (utilising ACD as a policy or strategic choice must be considered carefully, given its inherent characteristic of action beyond the defender’s immediate network. Such risks raise a second problem when employing aggressive, extra-territorial measures: the accurate attribution of the initial incident given the anonymising capacity of cyberspace and its effects on accurately identifying  perpetrators. The basic premise of the attribution problem is that one cannot know with 100% certainty that the identified origin location of a security breach is the true origin of that
breach32. )
- The content/analytics based filters might be used by government to filter media and surveillance.
The concept of combatting threats outside the network or systems under attack therefore raises a number of significant concerns, not least the capacity for defending actors to respond with kinetic military force and the ramifications of doing so. However, the extra-territoriality inherent to ACD is vital to our understanding of the concept as a methodological approach to cyber security due to the fact that it is this aggressive external action which differentiates ACD from other approaches. Such a description raises a fourth issue around ACD and current efforts to define the concept: the assumption that all other, non-active forms of cyber defence are “passive” or reactive in nature.  (US military and UK)

a definition of FCD is also offered here:" constructing systemically secure communications and information networks in order to establish defensive perimeters around key assets and minimise intentional or unintentional incidents or damage.  "

While the defining characteristic of ACD is aggressive action taken outside the defender’s
home network
, the defining characteristic of FCD is that approach’s preventive, introspective
focus. FCD measures seek to establish defensive perimeters through systems of firewalls and
antivirus software in order to minimise the chances of access to target systems and networks.

Resilience itself is predicated upon accepting that incidents will occur and focussing on the
ability to recover from those incidents
62, either returning to the original state or adapting to
generate a new, adjusted state

RCD can therefore be defined as:" ensuring the continuity of system functionality and service provision by constructing communications and information networks with the systemic, inbuilt ability to withstand or adapt to intentional or unintentional incidents.  "

While ACD and FCD seek to identify threats and intrusions as soon as possible and deal
with them, RCD advocates sharing vital information regarding security breaches among all
interested parties and potential future victims
65  (EU and Japan)

Resilience is a common trait in current cyber security policy documents. The strategies of the
European Union (EU) and Japan favour this approach. They concentrate on sharing information
between public and private bodies, harmonising public infrastructure security measures and
developing uniform standards of security
66 to ensure preparedness in the event of a natural or
malicious incident. 
The defining characteristic of RCD is this idea of functional continuity. 
The EU is currently considering legislation which would make it a legal requirement for all relevant public and private actors to share security breach information69.  

The result of this classification is the identification of not two modes of cyber defence (active or
passive), but three – active, fortified and resilient cyber defence. However the three paradigms
are not mutually exclusive. While very different given their varying techniques, each approach
operates in conjunction with the other to achieve a wider single goal, cyber security. By
concentrating not on the implementation of the measures themselves but their ultimate goals
these three paradigms together form a “Triptych of Cyber Security”: three parallel approaches
to achieving security when interacting with and utilising cyberspace.


Active cyber defence (ACD) is an approach to cyber security predicated upon proactive measures
to identify malicious codes and other threats, as well as aggressive external techniques designed
to neutralise threat agents. ACD is defined by the capacity and willingness to take action outside
the victim network
72. Despite this, ACD is not mirrored by “passive cyber defence”. The
measures collated under this term should more accurately be classified as fortified and resilient
cyber defence. These terms clarify the nature of the action taken by focussing on the end goals
of the measures they describe.

The three types of cyber defence described here are not mutually exclusive. Instead they operate  
in conjunction with one another in a triptych of measures further highlighting the inaccuracy
of a simple divide between active and passive approaches. 
 The goal of cyber security is to
enable operations in cyberspace free from the risk of physical or digital harm. To that end, the
three paradigms of defence postulated here work together to complement each other through
a range of measures designed to address specific issues around online security. Active cyber
defence focusses on identifying and neutralising threats and threat agents both inside and
outside the defender’s network, while fortified defence builds a protective environment. In
its turn resilience focusses on ensuring system continuity. The national strategies developed
over the last ten years demonstrate the complementarity of these three approaches. The US and
UK categorically adopt an active paradigm, whereby all available resources are deployed to
protect national interests, including proactively seeking out enemy actors and rendering them
ineffective. The US further retains the right to deploy the ultimate sanction of kinetic military
force in the event of a cyber-attack as a measure of last resort. However, neither the UK nor the
US are ignorant of the benefits of fortifying assets, or of making critical national infrastructures
resilient to the failures of the communications systems on which they rely
73. For Germany the
policy of choice is FCD but network resilience is recognised in a commitment to protecting and
securing critical digital infrastructures due to their importance to physical social and economic
74. The EU and Japan adopt a resilience-based framework, yet both are seeking to
develop active defence capabilities

 What this demonstrates is a conscious acknowledgement that one single approach to cyber
security is not enough. Active cyber defence, including all the measures that that concept entails,
is insufficient when seeking to achieve cyber security. Steps must be taken to fortify assets in
order to minimise the likelihood and effectiveness of cyber-incidents, as well as ensure system
and infrastructure continuity should an incident occur. Equally, FCD and RCD do not serve as
effective deterrents to would-be attackers. The willingness to identify and pursue threat agents
into their own home networks must be demonstrated alongside asset fortification and system
resilience. In short, the paradigms of cyber defence are not stand-alone approaches. Even for
those actors which place their strategies within an active framework, military or security agency
resources are not the only ones utilised. The consequence of this is the deployment of elements
of each approach simultaneously in a triptych of approaches intended to achieve a single goal

 By contextualising ACD as an approach which is used collaboratively with its fortified and
resilient cousins in a triptych of cyber security, and highlighting the crucial difference of
aggressive action beyond the victim network, it is possible to distil a definition of the term
“active cyber defence”. This is in spite of ACD being fraught with unresolved legal and
diplomatic difficulties. For the purposes of classification, a definition of active cyber defence
is proposed here:
" a method of achieving cyber security predicated upon the deployment of measures to detect, analyse, identify and mitigate threats to and from cyberspace in real-time, combined with the capability and resources to take proactive or aggressive action against threat agents in those
agents’ home networks

The question of definition and classification in the cyber security debate will not be
resolved overnight. While active cyber defence is one feature of that debate, the definition
and classification offered here will go some way towards establishing a cohesive lexicon of
terminology, an exercise which will assist the development of legal and political solutions to
the complex issue of cyber security

Navy postgraduate MS thesis

Cyber Exploitation
This typology refers to the exploitation of computer systems involved in a cyber attack in order to obtain intelligence that can aid in the analysis of the attack and in determining attribution. (e.g. hack the compromised proxy computer and one by one to retrieve the path)

Counter Attack
In the case of a cyber attack, the equivalent would be to counter hack the attacker responsible for the cyber attack. (e.g. hack the C&C server for Botnet DDoS)

Preemptive Strikes
In the context of cyberspace, a preemptive strike can be described as “conducting
an attack on a system or network in anticipation of that system or networking conducting
an attack on your system.”
68 (e.g. 1) disrupt potential attackers network 2) make presence in potential attacker network and deter the attack)

Preventive Strikes
A preventive cyber attack can be launched against a
hostile actor (both state and non-state) to prevent the latter from acquiring any cyber
offensive capability. (e.g. Stuxnet prevent Iran from building nuclear weapon)


Active cyber defenses may one day offer strategic advantages similar to those for active defenses in conventional warfare. They can help establish attribution of a cyber attack, deter attacks by creating the fear of retaliatory attacks in potential attackers, or even preempt an imminent attack. The typologies of active cyber defense are cyber exploitation, counter cyber attack, preemptive cyber attack, and preventive cyber attack. Cyber exploitation refers to the hacking of third party or the attackers’ computers and networks for the purpose of gaining information about the attack, including the source of the attack, the methods and tools used, the scope of the attack, and data that may have been taken. Cyber exploitation, when carefully carried out, need not disrupt target computers and may even goes undetected.

Counter cyber attack refers to the launching of a cyber attack against an attacker. The objective could be to interrupt an attack in progress and limit its effects. The counter attack could take the form of DOS attack or intrusion into the attacker’s network.

Alternatively, if the attacker is stealing sensitive documents, it could take the form of booby-trapping the documents with a remote-controlled Trojan, which then could be used to collect information about the attacker and/or shut down the attacker’s computer or network.

Preemptive cyber attack refers to the launching of a cyber attack against an adversary in anticipation of an imminent cyber attack. Preventive cyber attack refers to the launching of a cyber attack against an adversary based on a judgment that the adversary will be attacking.


Useful information from wiki

以宿主機作業系統的角度來看,QEMU 就是一般的使用者進程。QEMU 會在自己的虛擬位址空間分配內存給客戶機作業系統。以客戶機作業系統的角度來看,該塊內存即是客戶機作業系統的物理內存。該物理內存分成一般使用的內存和內存映射 IO。透過 cpu_register_physical_memory_offset (exec.c) 註冊。hw/* 會依序呼叫 cpu_register_io_memory 註冊 IO 模擬的函式和 cpu_register_physical_memory。
QEMU 分配給客戶機的內存是以 RAMBlock 和 RAMList 來管理。內存主要分為底下幾類 (cpu-common.h):
  • IO_MEM_RAM: 一般內存。
  • IO_MEM_ROMD: 讀的時候視作為 ROM,寫的時候視作為裝置。
  1. 透過 qemu_ram_alloc (exec.c) 申請空間。
    ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size)
        RAMBlock *new_block, *block;
        // RAMBlock 會被賦予一個字串名稱。
        pstrcat(new_block->idstr, sizeof(new_block->idstr), name);
        // 檢視 RAMList 中是否已有具有相同名稱的 RAMBlock。
        QLIST_FOREACH(block, &ram_list.blocks, next) {
        if (mem_path) {
        } else {
            // 指向宿主的虛擬內存位址。
            new_block->host = qemu_vmalloc(size);
        new_block->offset = find_ram_offset(size); // 該 RAMBlock 在 RAMList 的偏移量。
        new_block->length = size; // 該 RAMBlock 的大小。
        QLIST_INSERT_HEAD(&ram_list.blocks, new_block, next); // 將新增的 RAMBlock 加入 RAMList。
        ram_list.phys_dirty = qemu_realloc(ram_list.phys_dirty,
                                           last_ram_offset() >> TARGET_PAGE_BITS);
        memset(ram_list.phys_dirty + (new_block->offset >> TARGET_PAGE_BITS),
               0xff, size >> TARGET_PAGE_BITS);
        return new_block->offset; // 回傳該 RAMBlock 在 RAMList 的偏移量。
  2. 透過 cpu_register_physical_memory → cpu_register_physical_memory_offset 註冊該 RAMBlock 的資訊 (跟 QEMU 註冊客戶機物理內存)。target_phys_addr_t 代表客戶機物理內存空間; ram_addr_t 代表宿主機虛擬內存空間。
    void cpu_register_physical_memory_offset(target_phys_addr_t start_addr,
                                             ram_addr_t size,
                                             ram_addr_t phys_offset,
                                             ram_addr_t region_offset)
        PhysPageDesc *p;
        for(addr = start_addr; addr != end_addr; addr += TARGET_PAGE_SIZE) {
            p = phys_page_find(addr >> TARGET_PAGE_BITS); // 用客戶機物理位址 start_addr 查找 l1_phys_map
            if (p && p->phys_offset != IO_MEM_UNASSIGNED) {
                // 1_phys_map 中已存在該客戶機物理位址的項目。
            } else {
              // 針對該客戶機物理位址在 1_phys_map 中配置 PhysPageDesc 並更新相應的欄位。
                p = phys_page_find_alloc(addr >> TARGET_PAGE_BITS, 1);
  3. 使用 MMIO 的裝置會先呼叫 cpu_register_io_memory 註冊 IO 模擬函式。cpu_register_io_memory 返回 io_mem_write/io_mem_read 的索引。該索引被當作 phys_offset 傳給 cpu_register_physical_memory。
    static void cirrus_init_common(CirrusVGAState * s, int device_id, int is_pci)
        s->vga.vga_io_memory = cpu_register_io_memory(cirrus_vga_mem_read,
                                                      cirrus_vga_mem_write, s);
        cpu_register_physical_memory(isa_mem_base + 0x000a0000, 0x20000,


PhysPageDesc 用來描述客戶機物理頁面和宿主機虛擬頁面的對映。有一個二級頁表 l1_phys_map 存放 PhysPageDesc。phys_page_find_alloc 用客戶機物理位址查詢 l1_phys_map 取得 PhysPageDesc,視情況配置新的 PhysPageDesc. (模拟硬件MMU里的TLB)
  • cpu_register_physical_memory_offset → phys_page_find → phys_page_find_alloc。
    static PhysPageDesc *phys_page_find_alloc(target_phys_addr_t index, int alloc)
        // 取得一級頁表項
        lp = l1_phys_map + ((index >> P_L1_SHIFT) & (P_L1_SIZE - 1));
        // 視 alloc 是否要分配二級頁表項
        for (i = P_L1_SHIFT / L2_BITS - 1; i > 0; i--) {
        pd = *lp;
        if (pd == NULL) {
            for (i = 0; i < L2_SIZE; i++) {
                pd[i].phys_offset = IO_MEM_UNASSIGNED;
                pd[i].region_offset = (index + i) << TARGET_PAGE_BITS;


PageDesc 維護 TB 和虛擬頁面/客戶機物理頁面之間的關係 (視 process/system mode 而定)。同樣有一個二級頁表 l1_map 存放 PageDesc。page_find_alloc 查詢 l1_map 取得 PageDesc,視情況配置新的 PageDesc。
  • tb_find_slow → tb_gen_code → tb_link_page → tb_alloc_page → page_find_alloc。
    static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
        // ALLOC 在 process mode 使用 mmap; 在 system mode 使用 qemu_mallocz。
        /* Level 1.  Always allocated.  */
        lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
        /* Level 2..N-1.  */
        for (i = V_L1_SHIFT / L2_BITS - 1; i > 0; i--) {
QEMU 基本上是以 page 為單位將該 page 所屬 TB 清掉。
  • stl_mmu (softmmu_template.h) → io_writel (softmmu_template.h) → notdirty_mem_writel (exec.c) → notdirty_mem_writel → tb_invalidate_phys_page_fast (exec.c) → tb_invalidate_phys_page_range (exec.c) → tb_phys_invalidate (exec.c) 會將屬於某虛擬頁面/客戶機物理頁面的 TB 清掉。
    void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end, ...)
        p = page_find(start >> TARGET_PAGE_BITS);
        /* we remove all the TBs in the range [start, end[ */
        tb = p->first_tb;
        while (tb != NULL) {
  • tb_invalidate_phys_page (exec.c) → tb_phys_invalidate (exec.c)。tb_invalidate_phys_page 僅在 process mode 有定義,用來處理 SMC。
    #if !defined(CONFIG_SOFTMMU)
    static void tb_invalidate_phys_page(tb_page_addr_t addr,
                                        unsigned long pc, void *puc)
        addr &= TARGET_PAGE_MASK;
        p = page_find(addr >> TARGET_PAGE_BITS);
        // 取得該 page 的第一個 tb。
         // tb 末兩位如果是 01 (1),代表 tb 對應的 guest bianry 跨 page。
         tb = p->first_tb;
        while (tb != NULL) {
            n = (long)tb & 3; // 取得 block chaing 的方向
             tb = (TranslationBlock *)((long)tb & ~3); // 去掉末兩位的編碼,還原回真正的 tb
            tb_phys_invalidate(tb, addr);
            tb = tb->page_next[n]; // 取得 tb 所屬 page (或下一個 page) 的下一個 tb
        p->first_tb = NULL;
  • 最終會呼叫到 tb_phys_invalidate。
    void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
        // 將該 tb 從 tb_phys_hash 中移除
         phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK); // virtual addr 中 page offset 的部分和 physical addr 一樣
         h = tb_phys_hash_func(phys_pc);
        tb_remove(&tb_phys_hash[h], tb,
                  offsetof(TranslationBlock, phys_hash_next));
        // 將 tb 從相應的 PageDesc 中移除
        if (tb->page_addr[0] != page_addr) {
            p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
            tb_page_remove(&p->first_tb, tb);
        if (tb->page_addr[1] != -1 && tb->page_addr[1] != page_addr) {
            p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
            tb_page_remove(&p->first_tb, tb);
        tb_invalidated_flag = 1;
        // 將 tb 從 tb_jmp_cache 移除
         h = tb_jmp_cache_hash_func(tb->pc);
        // 因為每一個 env 都有一份自己的 tb_jmp_cache,全部清除。
        for(env = first_cpu; env != NULL; env = env->next_cpu) {
            if (env->tb_jmp_cache[h] == tb)
                env->tb_jmp_cache[h] = NULL;
        // 處理 tb1 (tb -> tb1) 
        tb_jmp_remove(tb, 0);
        tb_jmp_remove(tb, 1);
        // 處理 tb1 (tb1 -> tb)
        tb1 = tb->jmp_first;
        for(;;) {
            n1 = (long)tb1 & 3;
            if (n1 == 2) // tb1 末兩位如果為 10 (2),代表 tb1 沒有跳至其它 tb
            tb1 = (TranslationBlock *)((long)tb1 & ~3); // 還原回原本的 tb1
            tb2 = tb1->jmp_next[n1]; // 處理 tb2 (tb1 -> tb2)
            tb_reset_jump(tb1, n1); // 將 tb1 至其它的 tb 的 block chaining 打斷 (code cache)
            tb1->jmp_next[n1] = NULL;
            tb1 = tb2;
        tb->jmp_first = (TranslationBlock *)((long)tb | 2); // 將 jmp_first 再次指向自己
    • tb_jmp_remove 將該 tb 移出 circular lists‎。
      static inline void tb_jmp_remove(TranslationBlock *tb, int n)
          ptb = &tb->jmp_next[n]; // n (0 或 1) 指示 tb 下一個 block chaining 的方向
           tb1 = *ptb; // 處理 tb1 (tb -> tb1)
          if (tb1) {
              /* find tb(n) in circular list */
              for(;;) {
                  tb1 = *ptb;
                  n1 = (long)tb1 & 3; // 取出 tb1 末兩位
                    tb1 = (TranslationBlock *)((long)tb1 & ~3); 還原回原本的 tb1
                  if (n1 == n && tb1 == tb) // 代表 tb 沒有跳至其它 tb
                  if (n1 == 2) {
                      ptb = &tb1->jmp_first; // 代表沒有其它 tb 跳至 tb1
                  } else {
                      ptb = &tb1->jmp_next[n1]; // 處理 tb2 (tb1 -> tb2)
              /* now we can suppress tb(n) from the list */
              *ptb = tb->jmp_next[n];
              tb->jmp_next[n] = NULL;
    • cpu_exec (cpu-exec.c) 會用到 tb_invalidated_flag。
      if (tb_invalidated_flag) {
          /* as some TB could have been invalidated because
             of memory exceptions while generating the code, we
             must recompute the hash index here */
          next_tb = 0;
          tb_invalidated_flag = 0;

Find Fast - tb_find_fast (cpu-exec.c)
Fins Slow - tb_find_slow (cpu-exec.c)
Build -  tb_gen_code (exec.c)
Flush - tb_flush (exec.c)
Chain - tb_add_jump (exec-all.h)
Excute - tcg_qemu_tb_exec (tcg/tcg.h)
Invalidate - tb_phys_invalidate (exec.c)
Unchain - cpu_unlink_tb (exec.c)
Restore - cpu_restore_state (translate-all.c

TCG (Tiny Code Generator)gen_intermediate_code (target-i386/translate.c)
tcg_gen_code (tcg/tcg.c)
CC (Code Cache)
gen_opc_buf and gen_opparam_buf (translate-all.c)
static_code_gen_buffer (exec.c)
TBD (TB Descriptor)
TranslationBlock (exec-all.h)
TBDA (TB Descriptor Array)
TranslationBlock *tbs (exec.c)
TBHT (TB Hash Table)
TranslationBlock *tb_phys_hash (exec.c)
MPD (Memory Page Descriptor)
PageDesc (exec.c)


  • MemoryRegion (memory.h)。
    struct MemoryRegion {
        /* All fields are private - violators will be prosecuted */
        const MemoryRegionOps *ops;
        void *opaque;
        MemoryRegion *parent;
        Int128 size;
        target_phys_addr_t addr;
        void (*destructor)(MemoryRegion *mr);
        ram_addr_t ram_addr;
        bool subpage;
        bool terminates;
        bool readable;
        bool ram;
        bool readonly; /* For RAM regions */
        bool enabled;
        bool rom_device;
        bool warning_printed; /* For reservations */
        MemoryRegion *alias;
        target_phys_addr_t alias_offset;
        unsigned priority;
        bool may_overlap;
        QTAILQ_HEAD(subregions, MemoryRegion) subregions;
        QTAILQ_ENTRY(MemoryRegion) subregions_link;
        QTAILQ_HEAD(coalesced_ranges, CoalescedMemoryRange) coalesced;
        const char *name;
        uint8_t dirty_log_mask;
        unsigned ioeventfd_nb;
        MemoryRegionIoeventfd *ioeventfds;

System Mode

Before QEMU 1.0

以 QEMU 1.0 版以前,qemu (i386-softmmu) 為例,主要流程如下:
main (vl.c) → init_clocks (qemu-timer.c) → module_call_init(MODULE_INIT_MACHINE) (module.c) → cpu_exec_init_all (初始 dynamic translator) (exec.c) → module_call_init(MODULE_INIT_DEVICE) (module.c) → machine→init (初始 machine) (vl.c) → main_loop (vl.c)
  • main_loop (vl.c) → qemu_main_loop_start (cpus.c) → cpu_exec_all (cpus.c) → main_loop_wait (vl.c)
    • cpu_exec_all (cpus.c) → qemu_clock_enable (qemu-timer.c) → qemu_alarm_pending (qemu-timer.c) → any_cpu_has_work (cpus.c)
    • cpu_exec_all (cpus.c) → qemu_cpu_exec (cpus.c) → cpu_x86_exec (cpu-exec.c) → tb_find_fast (cpu-exec.c) → tb_find_slow (cpu-exec.c)
      • tb_find_slow (cpu-exec.c) → get_page_addr_code (exec-all.h)
      • tb_find_slow (cpu-exec.c) → tb_gen_code (exec.c) → cpu_gen_code (translate-all.c) → gen_intermediate_code (target-i386/translate.c) → tcg_gen_code (tcg/tcg.c) → tcg_gen_code_common (tcg/tcg.c)
  • main_loop_wait (vl.c) 處理事件。
check exception -> check interrupt (setjmp) -> tb_find_fast -> tb_exec -> check exception (check interrupt)
main_loop_wait -> select (alarm)
QEMU 會設置定時器 (qemu_signal_init),定時發出 SINGALARM 將 QEMU 從 code cache 拉出,去檢查 exception 或 interrupt。
  1. 進入點為 main.c (vl.c)。初始化環境。
    int main(int argc, char **argv, char **envp)
        // QEMU 內部維護三個 clock,分別為: rt_clock,vm_clock 和 host_clock。
        // 之後會根據命令行參數將 rtc_clock 設為前述三者之一。
        // module_call_init -> pc_machine_init -> qemu_register_machine
        // 會有預設 QEMUMachine,之後處理命令行參數時可被替換。
        /* 處理命令行參數,並初始化環境 */
        // 初始 QEMU 會用到的鎖以及使用的 signal number
        if (qemu_init_main_loop()) {
            fprintf(stderr, "qemu_init_main_loop failed\n");
        // alarm_timers 數組存放各種 timer 相對應的啟動/終止函式指針,以及其它資料。
        // init_timer_alarm 依序呼叫 alarm_timers 數組中各個 timer 的啟動函式。
        // dynticks_start_timer 會註冊 SIGALRM 相對應的信號句柄。
        if (init_timer_alarm() < 0) {
            fprintf(stderr, "could not initialize alarm timer\n");
        /* init the dynamic translator */
        cpu_exec_init_all(tb_size * 1024 * 1024);
        // drive_init_func 最後會呼叫到 paio_init 註冊 SIGUSR2 的信號句柄。
        if (qemu_opts_foreach(&qemu_drive_opts, drive_init_func, &machine->use_scsi, 1) != 0)
        // 初始化設備
        // 建立 QEMUMachine (hw/pc_piix.c) 並呼叫 machine->init (pc_init_pci) 初始化。
        machine->init(ram_size, boot_devices,
                      kernel_filename, kernel_cmdline, initrd_filename, cpu_model);
        /* 初始化剩下的設備以及輸出設備 */
        main_loop(); // 主要執行迴圈
        return 0;
    • dynticks_start_timer 所註冊的 SIGALRM 的信號句柄是 host_alarm_handler。當宿主機作業系統發出 SIGALRM 時,host_alarm_handler 視情況會呼叫 qemu_notify_event。qemu_notify_event 用 cpu_exit 將 QEMU 從當前 code cache 中拉出來檢查 IO。關於 clock 請見 [Qemu-devel] Question on kvm_clock working ...
    • cpu_exec_init_all 的代碼如下:
      /* Must be called before using the QEMU cpus. 'tb_size' is the size
         (in bytes) allocated to the translation buffer. Zero means default
         size. */
      void cpu_exec_init_all(unsigned long tb_size)
          code_gen_ptr = code_gen_buffer;
      #if !defined(CONFIG_USER_ONLY)
          io_mem_init(); // 註冊 MMIO 回掉函式
      #if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_USE_GUEST_BASE)
          /* There's no guest base to take into account, so go ahead and
             initialize the prologue now.  */
    • pc_init_pci (hw/pc_piix.c) 呼叫 pc_init1 (hw/pc_piix.c) 進行 PC 機器的初始化。
      /* PC hardware initialisation */
      static void pc_init1(ram_addr_t ram_size, ...)
          // 呼叫 pc_new_cpu (hw/pc.c) -> cpu_init/cpu_x86_init (target-i386/helper.c) 初始化 CPU。
          // 配置客戶機內存,載入 BIOS。
          // 這部分在 QEMU 1.0 會用 memory API 改寫。
          // http://lists.gnu.org/archive/html/qemu-devel/2011-07/msg02716.html
          pc_memory_init(ram_size, kernel_filename, kernel_cmdline, initrd_filename,
                         &below_4g_mem_size, &above_4g_mem_size);
          // 呼叫 qemu_allocate_irqs (hw/irq.c) 設置中斷處理常式。
          cpu_irq = pc_allocate_cpu_irq();
          pc_vga_init(pci_enabled? pci_bus: NULL);
          /* init basic PC hardware */
          pc_basic_device_init(isa_irq, &floppy_controller, &rtc_state);
          pc_vga_init(pci_enabled? pci_bus: NULL);
          /* init basic PC hardware */
          pc_basic_device_init(isa_irq, &floppy_controller, &rtc_state);
      • Features/RamAPI
        void pc_memory_init(ram_addr_t ram_size, ...)
            // 透過 qemu_ram_alloc 跟 QEMU 申請內存空間。QEMU 以 RAMBlock 為單位分配內存,並以 RAMList 管理所有 RAMBlock。
            // QEMU 依命令行參數的不同,會從檔案或是跟宿主機作業系統申請 (posix_memalign) 配置空間。
            // 回傳的是 RAMBlock 在 RAMList 的偏移量。
            ram_addr = qemu_ram_alloc(NULL, "pc.ram",
                                      below_4g_mem_size + above_4g_mem_size);
            // 所有類型的 RAM (一般內存、內存映射 IO) 皆要透過 cpu_register_physical_memory 跟 QEMU 註冊。
            // 將該資訊記錄在 PhysPageDesc。
            cpu_register_physical_memory(0, 0xa0000, ram_addr);
                         below_4g_mem_size - 0x100000,
                         ram_addr + 0x100000);
  2. main_loop (vl.c) 是主要的執行迴圈。
    static void main_loop(void)
        // 若是沒有開啟 IO 執行緒的話,無作用。
        // 主要執行的無窮迴圈。
        for (;;) {
            do {
                bool nonblocking = false;
                nonblocking = cpu_exec_all(); // 翻譯並執行客戶端代碼
                main_loop_wait(nonblocking); // 處理 IO
            } while (vm_can_run()); // 如果此虛擬機沒有收到關機或是重開機等諸如此類的請求,則繼續執行。
           /* 檢查系統是否收到關機或是重開機的要求。若是關機,則跳離此無窮迴圈 */
        bdrv_close_all(); // 關閉所有設備
        pause_all_vcpus(); // 暫無作用
  3. 翻譯並執行客戶端代碼是由 cpu_exec_all (cpus.c) 負責。
    bool cpu_exec_all(void)
        // 依序檢視虛擬處理器
        for (; next_cpu != NULL && !exit_request; next_cpu = next_cpu->next_cpu) {
            CPUState *env = next_cpu;
                              (env->singlestep_enabled & SSTEP_NOTIMER) == 0);
            if (qemu_alarm_pending())
            if (cpu_can_run(env)) {
                // qemu_cpu_exec 以 process mode 的路徑執行。
                // cpu_x86_exec (cpu-exec.c) → tb_find_fast (cpu-exec.c) → tb_find_slow (cpu-exec.c)
                // cpu_exec 執行完後會返回 exception_index 狀態,狀態定義在 cpu-defs.h。 
                if (qemu_cpu_exec(env) == EXCP_DEBUG) {
            } else if (env->stop) {
        exit_request = 0;
        return any_cpu_has_work();
    • qemu_cpu_exec 基本上只額外多做計數。
  4. 處理 IO 是由 main_loop_wait (vl.c) 負責。How to use the select(), an I/O Multiplexer
    void main_loop_wait(int nonblocking)
        nfds = -1;
        QLIST_FOREACH(ioh, &io_handlers, next) {
          // 將欲處理的設備加入上述的 file set
        // 根據 nonblocking 與否計算 select 等待時間    
        tv.tv_sec = timeout / 1000;
        tv.tv_usec = (timeout % 1000) * 1000;
        // 將設備以 file descriptor 來處理
        // 用 select 由設備描述符中選擇一個能立即處理的設備
        // select 參數代表的意義分別是: 欲處理的設備個數,要處理的輸入設備的檔案描述詞的集合,要處理的輸出設備的檔案描述詞的集合,
        // 有突發狀態發生的設備的檔案描述詞的集合和要求 select 等待的時間。
        ret = select(nfds + 1, &rfds, &wfds, &xfds, &tv);
        if (ret > 0) {
            IOHandlerRecord *pioh;
            QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
              /* 處理設備 */  
        /* Check bottom-halves last in case any of the earlier events triggered
           them.  */

After QEMU 1.0

QEMU 1.0 開啟 IO thread,無法關閉。仍舊以 qemu-system-i386 為例:
模擬虛擬 CPU 和虛擬外設分為不同的執行緒。開機時至少會看到兩個執行緒,主執行緒處理 IO,另一個則是模擬虛擬 CPU 的執行緒。模擬客戶機 CPU 的流程如下:
  1. cpu_init/cpu_x86_init (target-i386/helper.c) 在初始化虛擬 CPU 時,會呼叫 qemu_init_vcpu
    CPUX86State *cpu_x86_init(const char *cpu_model)
        CPUX86State *env;
        static int inited;
        env = g_malloc0(sizeof(CPUX86State));
        env->cpu_model_str = cpu_model;
        /* init various static tables used in TCG mode */
        if (tcg_enabled() && !inited) {
            inited = 1;
    #ifndef CONFIG_USER_ONLY
            prev_debug_excp_handler =
        if (cpu_x86_register(env, cpu_model) < 0) {
            return NULL;
        env->cpuid_apic_id = env->cpu_index;
        return env;
  2. qemu_init_vcpu (cpus.c)
    void qemu_init_vcpu(void *_env)
        CPUState *env = _env;
        env->nr_cores = smp_cores;
        env->nr_threads = smp_threads;
        env->stopped = 1;
        if (kvm_enabled()) {
        } else {
  3. qemu_tcg_init_vcpu (cpus.c)
    static void qemu_tcg_init_vcpu(void *_env)
        CPUState *env = _env;
        /* share a single thread for all cpus with TCG */
        if (!tcg_cpu_thread) {
            env->thread = g_malloc0(sizeof(QemuThread));
            env->halt_cond = g_malloc0(sizeof(QemuCond));
            tcg_halt_cond = env->halt_cond;
            qemu_thread_create(env->thread, qemu_tcg_cpu_thread_fn, env,
    #ifdef _WIN32
            env->hThread = qemu_thread_get_handle(env->thread);
            while (env->created == 0) {
                qemu_cond_wait(&qemu_cpu_cond, &qemu_global_mutex);
            tcg_cpu_thread = env->thread;
        } else {
            env->thread = tcg_cpu_thread;
            env->halt_cond = tcg_halt_cond;
  4. qemu_tcg_cpu_thread_fn (cpus.c)
    static void *qemu_tcg_cpu_thread_fn(void *arg)
        CPUState *env = arg;
        /* signal CPU creation */
        for (env = first_cpu; env != NULL; env = env->next_cpu) {
            env->thread_id = qemu_get_thread_id();
            env->created = 1;
        /* wait for initial kick-off after machine start */
        while (first_cpu->stopped) {
            qemu_cond_wait(tcg_halt_cond, &qemu_global_mutex);
        while (1) {
            if (use_icount && qemu_clock_deadline(vm_clock) <= 0) {
        return NULL;
  5. tcg_exec_all (cpus.c) 執行所有的虛擬 CPU。
    static void tcg_exec_all(void)
        int r;
        /* Account partial waits to the vm_clock.  */
        if (next_cpu == NULL) {
            next_cpu = first_cpu;
        for (; next_cpu != NULL && !exit_request; next_cpu = next_cpu->next_cpu) {
            CPUState *env = next_cpu;
                              (env->singlestep_enabled & SSTEP_NOTIMER) == 0);
            if (cpu_can_run(env)) {
                r = tcg_cpu_exec(env);
                if (r == EXCP_DEBUG) {
            } else if (env->stop || env->stopped) {
        exit_request = 0;
    • qemu_tcg_cpu_thread_fn (cpus.c) → tcg_exec_all (cpus.c) → tcg_cpu_exec (cpus.c) → cpu_x86_exec (cpu-exec.c)
目前 QEMU 本身即為 IO thread 執行 main_loop_wait,當遇到 block IO 時,會 fork 出 posix-aio-compat.c worker thread 去處理。
  1. main (vl.c)
        /* open the virtual block devices */
        if (snapshot)
            qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot, NULL, 0);
        if (qemu_opts_foreach(qemu_find_opts("drive"), drive_init_func, &machine->use_scsi, 1) != 0)
        // qemu_init_main_loop 呼叫 main_loop_init (main-loop.c)
        if (qemu_init_main_loop()) {
            fprintf(stderr, "qemu_init_main_loop failed\n");
    • qemu_init_cpu_loop (cpus.c)
      void qemu_init_cpu_loop(void)
    • main_loop_init (main-loop.c)
      int main_loop_init(void)
          int ret;
          ret = qemu_signal_init();
          if (ret) {
              return ret;
          /* Note eventfd must be drained before signalfd handlers run */
          ret = qemu_event_init();
          if (ret) {
              return ret;
          return 0;
  2. main_loop (vl.c) 是主要的執行迴圈,IO thread。
    static void main_loop(void)
        bool nonblocking;
        int last_io = 0;
        do {
            nonblocking = !kvm_enabled() && last_io > 0;
            last_io = main_loop_wait(nonblocking);
        } while (!main_loop_should_exit());
  3. main_loop_wait (main-loop.c)
    int main_loop_wait(int nonblocking)
        fd_set rfds, wfds, xfds;
        int ret, nfds;
        struct timeval tv;
        int timeout;
        if (nonblocking) {
            timeout = 0;
        } else {
            timeout = qemu_calculate_timeout();
        tv.tv_sec = timeout / 1000;
        tv.tv_usec = (timeout % 1000) * 1000;
        /* poll any events */
        /* XXX: separate device handlers from system ones */
        nfds = -1;
    #ifdef CONFIG_SLIRP
        slirp_select_fill(&nfds, &rfds, &wfds, &xfds);
        qemu_iohandler_fill(&nfds, &rfds, &wfds, &xfds);
        glib_select_fill(&nfds, &rfds, &wfds, &xfds, &tv);
        if (timeout > 0) {
        ret = select(nfds + 1, &rfds, &wfds, &xfds, &tv);
        if (timeout > 0) {
        glib_select_poll(&rfds, &wfds, &xfds, (ret < 0));
        qemu_iohandler_poll(&rfds, &wfds, &xfds, ret);
    #ifdef CONFIG_SLIRP
        slirp_select_poll(&rfds, &wfds, &xfds, (ret < 0));
        /* Check bottom-halves last in case any of the earlier events triggered
           them.  */
        return ret;
main (vl.c) → qemu_opts_foreach (qemu-option.c) → qemu_aio_wait (aio.c) → qemu_bh_poll (async.c) → spawn_thread_bh_fn (posix-aio-compat.c) → do_spawn_thread (posix-aio-compat.c)
aio_thread (posix-aio-compat.c) → cond_timedwait (posix-aio-compat.c)
底下腳本可以觀察 QEMU 本身。
$ vi command.gdb
set breakpoint pending on
file qemu
handle SIGUSR2 noprint nostop
break main_loop
run linux-0.2.img -vnc
$ gdb -x command.gdb


虛擬機重啟 (reboot) 的時候,會重置 virtual cpu 的 reset vector,這樣 virtual cpu 才會跳至開機預設的位址執行。請在 cpu_reset 下斷點,並 reboot 虛擬機 1)
(gdb) bt
#0  cpu_reset (env=0x1251290) at /nfs_home/chenwj/work/svn/qemu-1.0/target-i386/helper.c:37
#1  0x0000000000638753 in pc_cpu_reset (opaque=0x1251290) at /nfs_home/chenwj/work/svn/qemu-1.0/hw/pc.c:928
#2  0x00000000004fe916 in qemu_system_reset (report=true) at /nfs_home/chenwj/work/svn/qemu-1.0/vl.c:1381
#3  0x00000000004feb71 in main_loop_should_exit () at /nfs_home/chenwj/work/svn/qemu-1.0/vl.c:1452
#4  0x00000000004fec48 in main_loop () at /nfs_home/chenwj/work/svn/qemu-1.0/vl.c:1485
#5  0x0000000000503864 in main (argc=4, argv=0x7fffffffe218, envp=0x7fffffffe240) at /nfs_home/chenwj/work/svn/qemu-1.0/vl.c:3485
  1. 當有 reboot (reset) 的需要時,會呼叫 qemu_system_reset_request (vl.c) 拉起 reset_requested。
    void qemu_system_reset_request(void)
        if (no_reboot) {
            shutdown_requested = 1;
        } else {
            reset_requested = 1;
    • 以 i386 為例,大約有以下幾處會呼叫 qemu_system_reset_request。前兩者都是當出現 Triple fault 的時候重啟系統,後者是拉起 port 92。
      1. target-i386/op_helper.c
      2. target-i386/helper.c
      3. hw/pc.c
  2. 在 main_loop (vl.c) 中會呼叫 main_loop_should_exit 判斷是否需要跳離主迴圈。
    static void main_loop(void)
        bool nonblocking;
        int last_io = 0;
        do {
            nonblocking = !kvm_enabled() && last_io > 0;
            last_io = main_loop_wait(nonblocking);
        } while (!main_loop_should_exit());
  3. main_loop_should_exit
    static bool main_loop_should_exit(void)
        RunState r;
        if (qemu_debug_requested()) {
        if (qemu_shutdown_requested()) {
            monitor_protocol_event(QEVENT_SHUTDOWN, NULL);
            if (no_shutdown) {
            } else {
                return true;
        if (qemu_reset_requested()) { // 返回 reset_requested
            qemu_system_reset(VMRESET_REPORT); // 重啟系統
            if (runstate_check(RUN_STATE_INTERNAL_ERROR) ||
                runstate_check(RUN_STATE_SHUTDOWN)) {
        if (qemu_powerdown_requested()) {
            monitor_protocol_event(QEVENT_POWERDOWN, NULL);
        if (qemu_vmstop_requested(&r)) {
        return false;
  4. void qemu_system_reset(bool report)
        QEMUResetEntry *re, *nre;
        /* reset all devices */
        // 從 reset_handlers 抓出 device 重啟。之前就會用 qemu_register_reset 註冊各個裝置的 reset 回掉函式。
        QTAILQ_FOREACH_SAFE(re, &reset_handlers, entry, nre) {
        if (report) {
            monitor_protocol_event(QEVENT_RESET, NULL);
  5. pc_cpu_reset 呼叫 cpu_reset (target-i386/helper.c)。
    static void pc_cpu_reset(void *opaque)
        CPUState *env = opaque;
        env->halted = !cpu_is_bsp(env);
  6. cpu_reset (target-i386/helper.c) 開機或是重啟時會將 CPU 狀態重置。
    void cpu_reset(CPUX86State *env)
        int i;
        if (qemu_loglevel_mask(CPU_LOG_RESET)) {
            qemu_log("CPU Reset (CPU %d)\n", env->cpu_index);
            log_cpu_state(env, X86_DUMP_FPU | X86_DUMP_CCOP);
        memset(env, 0, offsetof(CPUX86State, breakpoints));
        tlb_flush(env, 1);
        env->old_exception = -1;
        /* init to reset state */
        env->hflags |= HF_SOFTMMU_MASK;
        env->hflags2 |= HF2_GIF_MASK;
        cpu_x86_update_cr0(env, 0x60000010);
        env->a20_mask = ~0x0;
        env->smbase = 0x30000;
        env->idt.limit = 0xffff;
        env->gdt.limit = 0xffff;
        env->ldt.limit = 0xffff;
        env->ldt.flags = DESC_P_MASK | (2 << DESC_TYPE_SHIFT);
        env->tr.limit = 0xffff;
        env->tr.flags = DESC_P_MASK | (11 << DESC_TYPE_SHIFT);
        cpu_x86_load_seg_cache(env, R_CS, 0xf000, 0xffff0000, 0xffff,
                               DESC_P_MASK | DESC_S_MASK | DESC_CS_MASK |
                               DESC_R_MASK | DESC_A_MASK);
        cpu_x86_load_seg_cache(env, R_DS, 0, 0, 0xffff,
                               DESC_P_MASK | DESC_S_MASK | DESC_W_MASK |
        cpu_x86_load_seg_cache(env, R_ES, 0, 0, 0xffff,
                               DESC_P_MASK | DESC_S_MASK | DESC_W_MASK |
        cpu_x86_load_seg_cache(env, R_SS, 0, 0, 0xffff,
                               DESC_P_MASK | DESC_S_MASK | DESC_W_MASK |
        cpu_x86_load_seg_cache(env, R_FS, 0, 0, 0xffff,
                               DESC_P_MASK | DESC_S_MASK | DESC_W_MASK |
        cpu_x86_load_seg_cache(env, R_GS, 0, 0, 0xffff,
                               DESC_P_MASK | DESC_S_MASK | DESC_W_MASK |
        env->eip = 0xfff0;
        env->regs[R_EDX] = env->cpuid_version;
        env->eflags = 0x2;
        /* FPU init */
        for(i = 0;i < 8; i++)
            env->fptags[i] = 1;
        env->fpuc = 0x37f;
        env->mxcsr = 0x1f80;
        env->pat = 0x0007040600070406ULL;
        env->msr_ia32_misc_enable = MSR_IA32_MISC_ENABLE_DEFAULT;
        memset(env->dr, 0, sizeof(env->dr));
        env->dr[6] = DR6_FIXED_1;
        env->dr[7] = DR7_FIXED_1;
        cpu_breakpoint_remove_all(env, BP_CPU);
        cpu_watchpoint_remove_all(env, BP_CPU);

Software MMU

  • target_phys_addr_t (targphys.h) 代表客戶機物理地址空間。如果客戶機是 x86 開啟 PAE 的話,target_phys_addr_t 為 64 bit。
  • target_ulong 代表客戶機暫存器大小和虛擬地址空間。如果客戶機是 x86 開啟 PAE 的話,target_ulong 為 32 bit。
  • ram_addr_t (cpu-common.h) 代表宿主機虛擬地址空間。如果宿主機是 x86 的話,ram_addr_t 為 32 bit。
  • tb_page_addr_t (exec-all.h) 在 system mode 中被 typedef 成 ram_addr_t; 在 process mode 中被 typedef 成 abi_ulong,abi_ulong 又被 typedef 成 target_ulong。
guest virtual addr (GVA) → guest physical addr (GPA) → host virtual addr (HVA)
  1. GVA → GPA 由客戶機作業系統負責; GPA → HVA 由 QEMU 負責。HVA → HPA 由宿主機作業系統負責
      • GVA → HVA。存放 GVA 相對於 HVA 的偏移量。轉換 GVA 到 HVA 的過程中,會先搜尋 TLB。如果命中,則將 GVA 加上該偏移量得到 HVA。若否,則需搜尋 l1_phys_map 並將 PhysPageDesc 填入 TLB。
    1. guest virtual addr → guest physical addr
    2. 搜尋 l1_phys_map 得到 PhysPageDesc。
    3. 將 phys_ram_base 加上 PhysPageDesc.phys_offset,得到 host virtual addr (physical addr → host virtual addr)
  typedef struct CPUTLBEntry {
    // 以下存放 GVA,同時也代表該頁面的權限。tlb_set_page (exec.c) 填入新的 TLB 項目時會做設置。
    target_ulong addr_read; // 可讀
    target_ulong addr_write; // 可寫
    target_ulong addr_code; // 可執行
    // HVA 相對於 GVA 的偏移量。
    unsigned long addend;
} CPUTLBEntry;
  1. tb_find_slow 是利用 guest pc (GVA) 對映的 guest physical address (GPA) 查找該 guest pc 的 TB。
    static TranslationBlock *tb_find_slow(target_ulong pc, ...)
        /* find translated block using physical mappings */
        phys_pc = get_page_addr_code(env, pc);
        phys_page1 = phys_pc & TARGET_PAGE_MASK;
        phys_page2 = -1;
        // 用虛擬位址 pc 對映的物理位址 phys_pc 查找 tb_phys_hash。  
         h = tb_phys_hash_func(phys_pc);
        ptb1 = &tb_phys_hash[h];
        for(;;) {
            tb = *ptb1;
            if (!tb)
                goto not_found;
            if (tb->pc == pc &&
                tb->page_addr[0] == phys_page1 && // 該 TB 所屬物理頁面 (guest code) 是否與 pc 所屬物理頁面相同?
                tb->cs_base == cs_base &&
                tb->flags == flags) {
                /* check next page if needed */
                if (tb->page_addr[1] != -1) { // 該 TB 有跨物理頁面
                    virt_page2 = (pc & TARGET_PAGE_MASK) +
                    phys_page2 = get_page_addr_code(env, virt_page2);
                    if (tb->page_addr[1] == phys_page2) // 該 TB 所屬的第二個物理頁面是否與 pc 所屬的第二個物理頁面相同?
                        goto found;
                } else {
                    goto found;
            ptb1 = &tb->phys_hash_next; // 當 phys_pc 雜湊到同一個 tb_phys_hash 項目時。
       /* if no translated code available, then translate it now */
        tb = tb_gen_code(env, pc, cs_base, flags, 0);
        /* we add the TB in the virtual pc hash table */
        env->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
        return tb;
  2. get_page_addr_code (exec.c) 先查找 TLB。process mode 情況有所不同,此時沒有所謂的 GPA,直接返回 addr。注意! get_page_addr_code 是被 tb_find_slow (cpu-exec.c) 或是 tb_gen_code (exec.c) 這兩個函式呼叫,get_page_addr_code 中的 code 代表存取的地址是一段 code。因此,皆是呼叫到 ld*_code 或是 ldb_cmmu。強烈建議查看 i386-softmmu/exec.i。
    static inline tb_page_addr_t get_page_addr_code(CPUState *env1, target_ulong addr)
        page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); // 計算 GVA 對映的 TLB 索引
         mmu_idx = cpu_mmu_index(env1);
        // TLB 不命中
        if (unlikely(env1->tlb_table[mmu_idx][page_index].addr_code !=
                     (addr & TARGET_PAGE_MASK))) {
        // TLB 命中。檢查欲執行的位址屬於 RAM 之後,計算 GVA 對映的 HVA。
        p = (void *)(unsigned long)addr
            + env1->tlb_table[mmu_idx][page_index].addend;
        // 返回 HVA 在 RAM 中的偏移量。 
        return qemu_ram_addr_from_host(p);
  3. TLB 不命中。ldub_code (softmmu_header.h) 是個透過宏展開的函式。
    static inline RES_TYPE glue(glue(ld, USUFFIX), MEMSUFFIX)(target_ulong ptr)
        if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ !=
                     (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
            // ADDR_READ 會視情況被替換成 addr_code 或是 addr_read。這裡因為存取的是 code,
             // ADDR_READ 被替換成 addr_code。 
             res = glue(glue(__ld, SUFFIX), MMUSUFFIX)(addr, mmu_idx);
        } else {
            physaddr = addr + env->tlb_table[mmu_idx][page_index].addend;
            res = glue(glue(ld, USUFFIX), _raw)((uint8_t *)physaddr);
        return res;
  4. ldb_cmmu (softmmu_template.h),其中的 cmmu 代表存取的是 code。如果是 mmu,代表存取的是 data。
    /* handle all cases except unaligned access which span two pages */
    DATA_TYPE REGPARM glue(glue(__ld, SUFFIX), MMUSUFFIX)(target_ulong addr,
                                                          int mmu_idx)
     // 先查找 TLB
        // ADDR_READ 會被替換成 addr_code。
        tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
        if ((addr & TARGET_PAGE_MASK) == (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
            // TLB 命中
            if (tlb_addr & ~TARGET_PAGE_MASK) {
                /* IO access */
                // iotlb 緩存 IO 模擬函式
                ioaddr = env->iotlb[mmu_idx][index];
            } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
                /* slow unaligned access (it spans two pages) */
                // 這裡會呼叫 slow_ldb_cmmu 做跨頁存取。
           } else {
                /* unaligned/aligned access in the same page */
                addend = env->tlb_table[mmu_idx][index].addend;
                res = glue(glue(ld, USUFFIX), _raw)((uint8_t *)(long)(addr+addend));
        } else {
            /* the page is not in the TLB : fill it */
            // GETPC 包裝 __builtin_return_address,請見 http://gcc.gnu.org/onlinedocs/gcc/Return-Address.html。
             // 其用途是取得此函式的 return address,藉此可得知從哪個 caller 呼叫到此函式。
             // 在 exec.c 的最後已將 GETPC 定為 NULL。
             retaddr = GETPC();
            // 不同 ISA 分別定義不同的 tlb_fill
            tlb_fill(addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
            goto redo;
        return res;
    • 在 exec.c 的最後定義如下的宏,並 include softmmu_template.h 將其中的宏展開。請見 exec.i。
      #define MMUSUFFIX _cmmu // load code
      #define GETPC() NULL // tb_find_slow -> get_page_addr_code -> ldub_code -> __ldb_cmmu
      #define env cpu_single_env
      #define SHIFT 0
      #include "softmmu_template.h"
      #define SHIFT 1
      #include "softmmu_template.h"
      #define SHIFT 2
      #include "softmmu_template.h"
      #define SHIFT 3
      #include "softmmu_template.h"
      #undef env
  5. tlb_fill (target-i386/op_helper.c)。查找頁表,如果頁存在,將該頁帶進 TLB; 如果頁不存在,發出頁缺失中斷。
    /* try to fill the TLB and return an exception if error. If retaddr is
       NULL, it means that the function was called in C code (i.e. not
       from generated code or from helper.c) */
    void tlb_fill(target_ulong addr, int is_write, int mmu_idx, void *retaddr)
        ret = cpu_x86_handle_mmu_fault(env, addr, is_write, mmu_idx, 1);
        if (ret) { // 出包了! 
            if (retaddr) { // tlb_fill 由 code cache 或是 helper.c 被呼叫。
                /* now we have a real cpu fault */
                pc = (unsigned long)retaddr;
                tb = tb_find_pc(pc);
                if (tb) {
                    /* the PC is inside the translated code. It means that we have
                       a virtual CPU fault */
                    cpu_restore_state(tb, env, pc, NULL);
            // tlb_fill 以一般的方式 (get_page_addr_code) 被呼叫,並非從 code cache 或是 helper function 被呼叫。
             // 發出 guest page fault exception,guest OS 開始 page fault 處理。
             raise_exception_err(env->exception_index, env->error_code);
        // 頁面已在內存。
        env = saved_env;
    • retaddr 非 NULL 代表 tlb_fill 並非從 get_page_addr_code 呼叫。例如,target-i386/op_helper.c 或是 code cache。請見 op_helper.i。
      uint8_t __ldb_mmu(target_ulong addr, int mmu_idx)
              retaddr = ((void *)((unsigned long)__builtin_return_address(0) - 1));
              tlb_fill(addr, 0, mmu_idx, retaddr);
              goto redo;
  6. cpu_x86_handle_mmu_fault (target-i386/helper.c) 查找頁表。如果該頁已在內存,呼叫 tlb_set_page 將該頁寫入 TLB。
    /* return value:
       -1 = cannot handle fault
       0  = nothing more to do // 頁面已在內存,填入適當 TLB 項目即可。 
       1  = generate PF fault  // 頁面不在內存,產生頁缺失。
    int cpu_x86_handle_mmu_fault(CPUX86State *env, target_ulong addr, ...)
        tlb_set_page(env, vaddr, paddr, prot, mmu_idx, page_size);
        return 0;
        return 1;
  7. tlb_set_page (exec.c) 填入 TLB 項。
    void tlb_set_page(CPUState *env, target_ulong vaddr, ...)
        CPUTLBEntry *te;
        // 回傳 host virtual address
        addend = (unsigned long)qemu_get_ram_ptr(pd & TARGET_PAGE_MASK);
        // 更新 TLB 項目
         te = &env->tlb_table[mmu_idx][index];
        te->addend = addend - vaddr; // host virtual address 與 guest virtual address 的偏移量。
  • softmmu-semi.h
  • softmmu_defs.h: 宣告 \_\_{ld,st}* 函式原型。
  • softmmu_exec.h: 利用 softmmu_header.h 生成 {ld,st}_{user,kernel,etc} 函式。{ld,st}_{user,kernel,etc} 函式又會呼叫到 \_\_{ld,st}* 函式。
  • softmmu_header.h: 生成 {ld,st}* 函式。{ld,st}* 函式又會呼叫到 \_\_{ld,st}* 函式。
  • softmmu_template.h: 生成 \_\_{ld,st}* 函式。
  • softmmu_defs.h: 宣告給 TCG IR qemu_ld/qemu_st 使用的 \_\_{ld,st}* 函式原型。被 softmmu_exec.h, tcg/xxx/tcg-target.c 和 exec-all.h 所 #include。
  • softmmu_template.h: exec.c 和 target-*/op_helper.c 使用 softmmu_template.h 生成 \_\_{ld,st}* 函式。tcg_out_qemu_ld (tcg/xxx/tcg-target.c) 在為 qemu_ld/qemu_st 產生 host binary 時,會呼叫到 softmmu_template.h 生成 \_\_{ld,st}* 函式。
    1. tcg_out_qemu_ld (tcg/i386/tcg-target.c)。
      #include "../../softmmu_defs.h"
      // softmmu_defs.h
      // uint8_t REGPARM __ldb_mmu(target_ulong addr, int mmu_idx);
      // void REGPARM __stb_mmu(target_ulong addr, uint8_t val, int mmu_idx);
      // 內存讀指令。會將該指令指定的虛擬位址透過 software MMU 轉換成物理位址。
      // softmmu_template.h 會透過宏展開定義相對應的函式。  
      static void *qemu_ld_helpers[4] = {
          __ldb_mmu, // load byte
          __ldw_mmu, // load word 
          __ldl_mmu, // load long word
          __ldq_mmu, // load quad word
      /* XXX: qemu_ld and qemu_st could be modified to clobber only EDX and
         EAX. It will be useful once fixed registers globals are less
         common. */
      static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
                                  int opc)
          /* 略 */
          tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]); // 呼叫上述函式。
          /* 略 */
    2. switch_tss (target-i386/op_helper.c)
      // #define MMU_MODE0_SUFFIX _kernel
      // #define MMU_MODE1_SUFFIX _user
      #include "cpu.h"
      // softmmu_exec.h 生成 {ld,st}*_{kernel,user,etc} 函式。
      // #define ACCESS_TYPE 0
      // #define DATA_SIZE 1
      // #include "softmmu_header.h"
      #if !defined(CONFIG_USER_ONLY)
      #include "softmmu_exec.h"
      #endif /* !defined(CONFIG_USER_ONLY) */
      #define MMUSUFFIX _mmu
      #define SHIFT 0
      #include "softmmu_template.h"
      // softmmu_template.h
      // SUFFIX 代表資料大小,可以是 b (byte, 8)、w (word, 16)、l (long word, 32) 或 q (quadruple word,64)
      // MMUSUFFIX 代表存取代碼或是資料,可以是 _cmmu 或 _mmu。
      DATA_TYPE REGPARM glue(glue(__ld, SUFFIX), MMUSUFFIX)(target_ulong addr,
                                                            int mmu_idx)
      // target-i386/op_helper.i
      uint8_t __ldb_mmu(target_ulong addr, int mmu_idx)
      # 1 "/tmp/chenwj/qemu/softmmu_exec.h" 1
      # 27 "/tmp/chenwj/qemu/softmmu_exec.h"
      # 1 "/tmp/chenwj/qemu/softmmu_header.h" 1
      # 83 "/tmp/chenwj/qemu/softmmu_header.h"
      // 存取內核態資料
      static __attribute__ (( always_inline )) __inline__ uint32_t ldub_kernel(target_ulong ptr)
          if (__builtin_expect(!!(env->tlb_table[mmu_idx][page_index].addr_read != (addr & (~((1 << 12) - 1) | (1 - 1)))), 0)
                                                                     ) {
              res = __ldb_mmu(addr, mmu_idx); // softmmu_defs.h 定義函式原型,其函式體由 softmmu_template.h 實現。 
          } else {
              physaddr = addr + env->tlb_table[mmu_idx][page_index].addend;
              // softmmu_exec.h 定義函式原型,其函式體由 softmmu_header.h 實現。
              res = ldub_p((uint8_t *)(long)(((uint8_t *)physaddr)));
      static void switch_tss(int tss_selector, ...)
          /* 略 */
          v1 = ldub_kernel(env->tr.base);
          v2 = ldub_kernel(env->tr.base + old_tss_limit_max);
          /* 略 */   
  • softmmu_exec.h: target-*/op_helper.c #include softmmu_exec.h,softmmu_exec.h 再利用 softmmu_header.h 生成 {ld,st}*_{kernel,user,etc} 函式。
    • softmmu_exec.h #include softmmu_defs.h,softmmu_defs.h 定義函式原型 \_\_{ld,st},其函式體由 softmmu_template.h 實現。softmmu_exec.h 也 #include softmmu_header.h,softmmu_header.h 定義巨集生成 {ld,st}*_{kernel,user,etc} 函式。
      // softmmu_defs.h
      uint8_t REGPARM __ldb_mmu(target_ulong addr, int mmu_idx);
      void REGPARM __stb_mmu(target_ulong addr, uint8_t val, int mmu_idx);
      // softmmu_template.h
      DATA_TYPE REGPARM glue(glue(__ld, SUFFIX), MMUSUFFIX)(target_ulong addr,
                                                            int mmu_idx)
      // softmmu_header.h
      static inline RES_TYPE glue(glue(ld, USUFFIX), MEMSUFFIX)(target_ulong ptr)
              /* 略 */
              res = glue(glue(__ld, SUFFIX), MMUSUFFIX)(addr, mmu_idx);
              /* 略 */
    • helper_fldt (target-i386/op_helper.c)
      #if !defined(CONFIG_USER_ONLY)
      #include "softmmu_exec.h"
      #endif /* !defined(CONFIG_USER_ONLY) */
      static inline floatx80 helper_fldt(target_ulong ptr)
          CPU_LDoubleU temp;
          temp.l.lower = ldq(ptr);     // #define ldub(p) ldub_data(p) in softmmu_exec.h
          temp.l.upper = lduw(ptr + 8);
          return temp.d;
  • softmmu_header.h: 定義巨集。被 softmmu_exec.h 和 exec-all.h #include 進而展開巨集,根據 MMU mode 和 data size 定義 inli ne ld/st 函式。
    • softmmu_exec.h: 被 target-xxx/op_helper.c 所 #include。根據 MMU mode (user/kernel) 和 data size 生成 inline ld/st 函式。
      // target-xxx/cpu.h 自行定義 MMU_MODE?_SUFFIX。
      // 以 i386 為例: _kernel,_user。
      // target-i386/op_helper.c
      #include "cpu.h"
      #include "softmmu_exec.h"
      // softmmu_exec.h
      #include "softmmu_defs.h"
      #define ACCESS_TYPE 0
      #define DATA_SIZE 1
      #include "softmmu_header.h"
      // softmmu_header.h
      // op_helper.i -> ldub_kernel
      // ld/st 最後會呼叫到 __ld/__st
      // kernel mode: env->tlb_table[0]
      // user mode: env->tlb_table[1]
      // data: env->tlb_table[(cpu_mmu_index(env))]
      static inline RES_TYPE glue(glue(ld, USUFFIX), MEMSUFFIX)(target_ulong ptr)
  1. exec-all.h。定義給 code cache 使用的 softmmu 函式。
    #include "softmmu_defs.h"
    // uint64_t REGPARM __ldq_mmu(target_ulong addr, int mmu_idx);
    // void REGPARM __stq_mmu(target_ulong addr, uint64_t val, int mmu_idx);
    // uint8_t REGPARM __ldb_cmmu(target_ulong addr, int mmu_idx);
    // void REGPARM __stb_cmmu(target_ulong addr, uint8_t val, int mmu_idx);
    #define ACCESS_TYPE (NB_MMU_MODES + 1)
    #define MEMSUFFIX _code
    #define env cpu_single_env
    #define DATA_SIZE 1
    #include "softmmu_header.h"
    // softmmu_header.h
    #elif ACCESS_TYPE == (NB_MMU_MODES + 1)
    #define CPU_MMU_INDEX (cpu_mmu_index(env))
    #define MMUSUFFIX _cmmu
    // 生成 ldub_cmmu -> __ldb_cmmu
    // env->tlb_table[(cpu_mmu_index(env))]
  • cpu-exec.i: {ld, st}{sb, ub}_{kernel, user, data, p} p: 直接讀。
QEMU softmmu 有幾處可以加速2)3)4)

System Call


以 x86 為例,有幾種情況會呼叫 tlb_flush。
  1. cpu_x86_update_crN。在 target-i386/translate.c 中,遇到 mov reg, crN 或是 mov crN, reg 會呼叫 helper_write_crN (target-i386/op_helper.c),helper_write_crN 再視情況呼叫 cpu_x86_update_crN (target-i386/helper.c)。Control register
  2. cpu_register_physical_memory_log
  3. cpu_reset
  4. cpu_x86_set_a20
  1. tlb_flush (exec.c)。
    void tlb_flush(CPUState *env, int flush_global)
        int i;
        /* must reset current TB so that interrupts cannot modify the
           links while we are modifying them */
        env->current_tb = NULL;
        for(i = 0; i < CPU_TLB_SIZE; i++) {
            int mmu_idx;
            for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
                env->tlb_table[mmu_idx][i] = s_cputlb_empty_entry;
        // 此時 softmmu (tlb) 失效,GVA -> HVA 的對映不再合法,所以要清空以 GVA (guest pc) 當索引的 tb_jmp_cache。 
        memset (env->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof (void *));
        env->tlb_flush_addr = -1;
        env->tlb_flush_mask = 0;
  2. 這時 QEMU 被迫以 tb_find_slow 改以 GPA 查找是否有以翻譯過的 TranslationBlock,同時會進行額外檢查。注意! 這時候有可能會重翻!
    static TranslationBlock *tb_find_slow(CPUState *env, ...)
        for(;;) {
            tb = *ptb1;
            if (!tb)
                goto not_found;
            if (tb->pc == pc &&
                tb->page_addr[0] == phys_page1 &&
                tb->cs_base == cs_base &&
                tb->flags == flags) {
                /* check next page if needed */
                if (tb->page_addr[1] != -1) {
                    tb_page_addr_t phys_page2;
                    virt_page2 = (pc & TARGET_PAGE_MASK) +
                    phys_page2 = get_page_addr_code(env, virt_page2);
                    if (tb->page_addr[1] == phys_page2)
                        goto found;
                } else {
                    goto found;
            ptb1 = &tb->phys_hash_next;

Interrupt & Exception Handling

以 x86 為例,
  • cpu-defs.h 定義例外號。
    #define EXCP_INTERRUPT  0x10000 /* async interruption */
    #define EXCP_HLT        0x10001 /* hlt instruction reached */
    #define EXCP_DEBUG      0x10002 /* cpu stopped after a breakpoint or singlestep */
    #define EXCP_HALTED     0x10003 /* cpu is halted (waiting for external event) */
  • 有些函式前面會加上 QEMU_NORETURN (compiler.h),這代表該函式不會返回。
    #define QEMU_NORETURN __attribute__ ((__noreturn__))
  1. cpu_exec (cpu-exec.c)
    int cpu_exec(CPUState *env)
        if (env->halted) { // system mode 才會拉起 env->halted。
            if (!cpu_has_work(env)) {
                return EXCP_HALTED;
            env->halted = 0;
        cpu_single_env = env; // 保存當前 env。待 lonjmp 時,可以用 cpu_single_env 回復 env。
        if (unlikely(exit_request)) {
            env->exit_request = 1;
    // 不同架構會有不同前置處理。
    #if defined(TARGET_I386)
        CC_SRC = env->eflags & (CC_O | CC_S | CC_Z | CC_A | CC_P | CC_C);
        DF = 1 - (2 * ((env->eflags >> 10) & 1));
        CC_OP = CC_OP_EFLAGS;
        env->eflags &= ~(DF_MASK | CC_O | CC_S | CC_Z | CC_A | CC_P | CC_C);
    #error unsupported target CPU
        // cpu_exec 返回值即為 env->exception_index。以 process mode 為例,cpu_loop 在呼叫 cpu_exec 之後,會檢視其返回值並做相應處理。 
        env->exception_index = -1;
        // 進行翻譯並執行的迴圈。
        /* prepare setjmp context for exception handling */
        for(;;) {
            if (setjmp(env->jmp_env) == 0) { // 正常流程。
                next_tb = 0; /* force lookup of first TB */
                for(;;) {
                } /* for(;;) */
            } else {
                /* Reload env after longjmp - the compiler may have smashed all
                 * local variables as longjmp is marked 'noreturn'. */
                env = cpu_single_env;
        } /* for(;;) */
    • QEMU 支持 precise exception。當例外發生時,執行流程會將舊的 env (cpu_single_env) 存回 env。
  2. 外層迴圈。cpu_exec 利用 setjmp/longjmp 處理例外。cpu_loop_exit (cpu-exec.c) 和 cpu_resume_from_signal (cpu-exec.c) 會呼叫 longjmp 回到 setjmp 設定的例外處理分支。5)
        for(;;) {
            if (setjmp(env->jmp_env) == 0) {
                /* if an exception is pending, we execute it here */
                if (env->exception_index >= 0) { // exception_index 非 -1 代表有事要處理。
                    if (env->exception_index >= EXCP_INTERRUPT) { // 來自 cpu_exec 以外的例外。
                        /* exit request from the cpu execution loop */
                        ret = env->exception_index;
                        if (ret == EXCP_DEBUG) {
                    } else {
    #if defined(CONFIG_USER_ONLY) 
                        /* if user mode only, we simulate a fake exception
                           which will be handled outside the cpu execution
                           loop */
    #if defined(TARGET_I386)
                        ret = env->exception_index;
            } else {
                /* Reload env after longjmp - the compiler may have smashed all
                 * local variables as longjmp is marked 'noreturn'. */
                env = cpu_single_env;
        } /* for(;;) */
        /* 回復 env 中的欄位,清空 cpu_single_env,返回 cpu_loop 處理例外 */
    • cpu_loop_exit (cpu-exec.c)
      void cpu_loop_exit(CPUState *env)
          env->current_tb = NULL;
          longjmp(env->jmp_env, 1);
    • cpu_resume_from_signal (cpu-exec.c)
      /* exit the current TB from a signal handler. The host registers are
         restored in a state compatible with the CPU emulator
      #if defined(CONFIG_SOFTMMU)
      void cpu_resume_from_signal(CPUState *env, void *puc)
          /* XXX: restore cpu registers saved in host registers */
          env->exception_index = -1;
          longjmp(env->jmp_env, 1);
  3. user mode 和 system mode 有不同處置。
    void do_interrupt(CPUState *env1)
        CPUState *saved_env;
        saved_env = env;
        env = env1;
    #if defined(CONFIG_USER_ONLY)
        /* if user mode only, we simulate a fake exception
           which will be handled outside the cpu execution
           loop */
        /* successfully delivered */
        env->old_exception = -1;
        /* simulate a real cpu exception. On i386, it can
           trigger new exceptions, but we do not handle
           double or triple faults yet. */
                         env->exception_next_eip, 0);
        /* successfully delivered */
        env->old_exception = -1;
        env = saved_env;
    • user mode。
      • General protection fault
        #if defined(CONFIG_USER_ONLY)
        /* fake user mode interrupt */
        static void do_interrupt_user(int intno, int is_int, int error_code,
                                      target_ulong next_eip)
            SegmentCache *dt;
            target_ulong ptr;
            int dpl, cpl, shift;
            uint32_t e2;
            dt = &env->idt; // 取出中斷描述符表
            if (env->hflags & HF_LMA_MASK) {
                shift = 4;
            } else {
                shift = 3;
            ptr = dt->base + (intno << shift); // 取出中斷處理常式
            e2 = ldl_kernel(ptr + 4);
            dpl = (e2 >> DESC_DPL_SHIFT) & 3;
            cpl = env->hflags & HF_CPL_MASK;
            /* check privilege if software int */
            if (is_int && dpl < cpl)
                raise_exception_err(EXCP0D_GPF, (intno << shift) + 2); // target-i386/cpu.h
            /* Since we emulate only user space, we cannot do more than
               exiting the emulation with the suitable exception and error
               code */
            if (is_int)
                EIP = next_eip; // target-i386/cpu.h:#define EIP (env->eip)
      • raise_exception_err (target-i386/op_helper.c) 轉呼叫 raise_interrupt (target-i386/op_helper.c)。
        static void QEMU_NORETURN raise_exception_err(int exception_index,
                                                      int error_code)
            raise_interrupt(exception_index, 0, error_code, 0);
      • raise_interrupt (target-i386/op_helper.c) 最後呼叫 cpu_loop_exit (cpu-exec.c) longjmp 回外層迴圈 else 分支。
         * Signal an interruption. It is executed in the main CPU loop.
         * is_int is TRUE if coming from the int instruction. next_eip is the
         * EIP value AFTER the interrupt instruction. It is only relevant if
         * is_int is TRUE.
        static void QEMU_NORETURN raise_interrupt(int intno, int is_int, int error_code,
                                                  int next_eip_addend)
            if (!is_int) {
                helper_svm_check_intercept_param(SVM_EXIT_EXCP_BASE + intno, error_code);
                intno = check_exception(intno, &error_code);
            } else {
                helper_svm_check_intercept_param(SVM_EXIT_SWINT, 0);
            env->exception_index = intno;
            env->error_code = error_code;
            env->exception_is_int = is_int;
            env->exception_next_eip = env->eip + next_eip_addend;
    • system mode。
      • do_interrupt_all (target-i386/op_helper.c)。
        static void do_interrupt_all(int intno, int is_int, int error_code,
                                     target_ulong next_eip, int is_hw)
            ... 略 ...
            // 檢查當前處於何種模式,交由相對應的函式處理。如果客戶機是 64 bit,還有 do_interrupt_64。
            if (env->cr[0] & CR0_PE_MASK) {
                if (env->hflags & HF_SVMI_MASK)
                    handle_even_inj(intno, is_int, error_code, is_hw, 0);
                    do_interrupt_protected(intno, is_int, error_code, next_eip, is_hw);
            } else {
                do_interrupt_real(intno, is_int, error_code, next_eip);
            ... 略 ...
      • 內核態與用戶態分別有內核棧和用戶棧。
      • do_interrupt_real
        static void do_interrupt_real(int intno, int is_int, int error_code,
                                      unsigned int next_eip)
            ... 略 ...
            // 取得內核棧棧頂。
            ssp = env->segs[R_SS].base;
            esp = ESP; // #define ESP (env->regs[R_ESP]) (target-i386/cpu.h)
            ssp = env->segs[R_SS].base;
            // 將用戶態資訊放至內核棧。
            /* XXX: use SS segment size ? */
            PUSHW(ssp, esp, 0xffff, compute_eflags());
            PUSHW(ssp, esp, 0xffff, old_cs);
            PUSHW(ssp, esp, 0xffff, old_eip);
            // 將 env->eip 指向中斷向量。返回 cpu_exec 後便會翻譯中斷向量並執行。
            /* update processor state */
            ESP = (ESP & ~0xffff) | (esp & 0xffff);
            env->eip = offset;
            env->segs[R_CS].selector = selector;
            env->segs[R_CS].base = (selector << 4);
            env->eflags &= ~(IF_MASK | TF_MASK | AC_MASK | RF_MASK);
      • helper_iret_real。當中斷向量執行完畢後,會執行 iret 返回用戶態。
        void helper_iret_real(int shift)
            ... 略 ...
            // 將用戶態資訊從內核棧取出。
            if (shift == 1) {
                /* 32 bits */
                POPL(ssp, sp, sp_mask, new_eip);
                POPL(ssp, sp, sp_mask, new_cs);
                new_cs &= 0xffff;
                POPL(ssp, sp, sp_mask, new_eflags);
            } else {
                /* 16 bits */
                POPW(ssp, sp, sp_mask, new_eip);
                POPW(ssp, sp, sp_mask, new_cs);
                POPW(ssp, sp, sp_mask, new_eflags);
            ESP = (ESP & ~sp_mask) | (sp & sp_mask);
            env->segs[R_CS].selector = new_cs;
            env->segs[R_CS].base = (new_cs << 4);
            env->eip = new_eip; // 返回用戶態後,欲執行的 pc。
            if (env->eflags & VM_MASK)
                eflags_mask = TF_MASK | AC_MASK | ID_MASK | IF_MASK | RF_MASK | NT_MASK;
                eflags_mask = TF_MASK | AC_MASK | ID_MASK | IF_MASK | IOPL_MASK | RF_MASK | NT_MASK;
            if (shift == 0)
                eflags_mask &= 0xffff;
            load_eflags(new_eflags, eflags_mask);
            env->hflags2 &= ~HF2_NMI_MASK;
  4. 內層迴圈。
                next_tb = 0; /* force lookup of first TB */
                for(;;) {
                    interrupt_request = env->interrupt_request;
                    // 檢視 interrupt_request 是何種中斷,並將 interrupt_request 復位。
                       // 設置 env->exception_index,再跳至 cpu_loop_exit。
                       // cpu_loop_exit 再 longjmp 到外層迴圈 setjmp 的點,跳到處理中斷的分支。
                      if (unlikely(interrupt_request)) {
                    // 檢視 env->exit_request。
                    if (unlikely(env->exit_request)) {
                        env->exit_request = 0;
                        env->exception_index = EXCP_INTERRUPT;
                    tb = tb_find_fast(env);
                    env->current_tb = tb;
                    // 若無 exit_request,跳入 code cache 開始執行。
                    if (likely(!env->exit_request)) {
                    env->current_tb = NULL;
                    /* reset soft MMU for next block (it can currently
                       only be set by a memory fault) */
                } /* for(;;) */


目前 QEMU 本身即為 IO thread 執行 main_loop_wait,當遇到 block IO 時,會 fork 出 worker thread 去處理。每一個 VCPU 均有對應的 VCPU 執行緒運行。
  1. main (vl.c)
    int main(int argc, char **argv, char **envp)
        ... 略 ...
        // qemu_init_main_loop 呼叫 main_loop_init (main-loop.c)
        if (qemu_init_main_loop()) {
            fprintf(stderr, "qemu_init_main_loop failed\n");
        ... 略 ...
        /* open the virtual block devices */
        if (snapshot)
            qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot, NULL, 0);
        if (qemu_opts_foreach(qemu_find_opts("drive"), drive_init_func, &machine->use_scsi, 1) != 0)
        ... 略 ...
        return 0;
    • qemu_init_cpu_loop (cpus.c)
      void qemu_init_cpu_loop(void)
    • qemu_init_main_loop (vl.c) 呼叫 main_loop_init (main-loop.c)。
      int main_loop_init(void)
          int ret;
          ret = qemu_signal_init();
          if (ret) {
              return ret;
          /* Note eventfd must be drained before signalfd handlers run */
          ret = qemu_event_init();
          if (ret) {
              return ret;
          return 0;
  2. main_loop (vl.c) 是主要的執行迴圈,即 IO thread。
    static void main_loop(void)
        bool nonblocking;
        int last_io = 0;
        do {
            nonblocking = !kvm_enabled() && last_io > 0;
            last_io = main_loop_wait(nonblocking);
        } while (!main_loop_should_exit());
  3. main_loop_wait (main-loop.c) 等待 work thread 完成任務。
    int main_loop_wait(int nonblocking)
        int ret;
        uint32_t timeout = UINT32_MAX;
        if (nonblocking) {
            timeout = 0;
        } else {
        /* poll any events */
        /* XXX: separate device handlers from system ones */
        nfds = -1;
        qemu_iohandler_fill(&nfds, &rfds, &wfds, &xfds);
        // 1. Waits for file descriptors to become readable or writable.
        ret = os_host_main_loop_wait(timeout);
        // fd 已便備,處理 IO。
         qemu_iohandler_poll(&rfds, &wfds, &xfds, ret);
        /* Check bottom-halves last in case any of the earlier events triggered
           them.  */
        return ret;
  • qemu_iohandler_poll (main-loop.c)。
    void qemu_iohandler_poll(fd_set *readfds, fd_set *writefds, fd_set *xfds, int ret)
        if (ret > 0) {
            IOHandlerRecord *pioh, *ioh;
            QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
                if (!ioh->deleted && ioh->fd_read && FD_ISSET(ioh->fd, readfds)) {
                if (!ioh->deleted && ioh->fd_write && FD_ISSET(ioh->fd, writefds)) {
                /* Do this last in case read/write handlers marked it for deletion */
                if (ioh->deleted) {
                    QLIST_REMOVE(ioh, next);
    • qemu_bh_poll (async.c) 處理 bh。
      struct QEMUBH {
          QEMUBHFunc *cb;
          void *opaque;
          QEMUBH *next;
          bool scheduled;
          bool idle;
          bool deleted;
      int qemu_bh_poll(void)
          QEMUBH *bh, **bhp, *next;
          ... 略 ...
         // 有需要的裝置透過 qemu_bh_new (async.c) 將自己的 handler 加進 BH 等待調用。
          // 這裡調用排定好的 bh handler。
          for (bh = first_bh; bh; bh = next) {
              next = bh->next;
              if (!bh->deleted && bh->scheduled) {
                  bh->scheduled = 0;
                  if (!bh->idle)
                      ret = 1;
                  bh->idle = 0;
          ... 略 ...
  • ioport.[ch]: port IO 不用做位址轉換
  • MMIO 需要做位址轉換: env→iotlb
  • DMA 使用物理位址。


QEMUTimer 和
struct QEMUTimer {
    QEMUClock *clock; // 使用特定的 QEMUClock 計時
    int64_t expire_time;
    QEMUTimerCB *cb; // callback function pointer
    void *opaque; // 傳給 callback function 的參數
    struct QEMUTimer *next;
QEMUClock 有底下幾種,請見 qemu-timer.h:
  1. rt_clock: 只有不會改變虛擬機的事物才能使用 rt_clock,這是因為 rt_clock 即使在虛擬機停止的情況下仍會運作。
  2. vm_clock: vm_clock 只有在虛擬機運行時才會運作。
  3. host_clock: 用來產生 real time source 的虛擬設備使用 host_clock。
rtc_clock 會選擇上述其中一種 clock。


請見 cpu-all.h,基本上有四類通用中斷:
  1. CPU_INTERRUPT_HARD: 虛擬外設發出的中斷。
  2. CPU_INTERRUPT_EXITTB: 用於某些外設改變其內存映射時,如: A20 line change。要求虛擬 CPU 離開目前的 TB。
另外留下 CPU_INTERRUPT_TGT_EXT_* 和 CPU_INTERRUPT_TGT_INT_* 給各個 CPU 自行運用。例如: target-i386/cpu.h。
  • hw/pc.c 一般 PC 周邊。
  • hw/irq.* 中斷之用。
  • hw/apic.c 模擬 APIC,負責發出中斷 (cpu_interrupt)。
  • hw/i8259.c 模擬 PIC。
  • hw/i8254.c 模擬時鐘。
QOM (Qemu Object Model) 用來取代 QDev 13)
虛擬外設發出的 IRQ 以 IRQState 包裝。在 QEMU 中,所有的设备包括总线,桥,一般设备都对应一个设备结构。總線,如 PCI 總線,在 QEMU 中包裝成 PCIBus; 橋,如 PCI 橋,在 QEMU 中包裝成 PCIBridgePCIDeviceInfo
  • pc_init_pci (hw/pc_piix.c) 呼叫 pc_init1 (hw/pc_piix.c) 進行 PC 機器的初始化。
    /* PC hardware initialisation */
    static void pc_init1(ram_addr_t ram_size, ...)
        /* 初始化 CPU */
        /* 初始化內存 */
        /* 初始化 PIC */
        if (!xen_enabled()) {
            cpu_irq = pc_allocate_cpu_irq();
            i8259 = i8259_init(cpu_irq[0]);
        } else {
            i8259 = xen_interrupt_controller_init();
        /* 初始化 ISA */
        isa_irq_state = qemu_mallocz(sizeof(*isa_irq_state));
        isa_irq_state->i8259 = i8259;
        /* 初始化 IOAPIC */
        if (pci_enabled) {
            ioapic_init(isa_irq_state); // sysbus_get_default 會創建 main-system-bus
        /* 初始化 PCI bus,之後即可將外設掛上 PCI bus */
        if (pci_enabled) {
            pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, isa_irq, ram_size);
        } else {
            pci_bus = NULL;
            i440fx_state = NULL;
        /* 初始化其它外設 */
    • i8259 (PIC) 請見 PicState2 和 PicState。請見 8259 PIC 和 Intel 8259
      qemu_irq *i8259_init(qemu_irq parent_irq)
          PicState2 *s;
          s = qemu_mallocz(sizeof(PicState2));
          pic_init1(0x20, 0x4d0, &s->pics[0]); // Master IO port 為 0x20
          pic_init1(0xa0, 0x4d1, &s->pics[1]); // Slave IO port 為 0xa0
          s->pics[0].elcr_mask = 0xf8;
          s->pics[1].elcr_mask = 0xde;
          s->parent_irq = parent_irq;
          s->pics[0].pics_state = s;
          s->pics[1].pics_state = s;
          isa_pic = s;
          return qemu_allocate_irqs(i8259_set_irq, s, 16);
    • i440fx_init → i440fx_common_init。請見 Intel 440FX 和 PCI IDE ISA Xcelerator
      static PCIBus *i440fx_common_init(const char *device_name, ...)
          DeviceState *dev;
          PCIBus *b;
          PCIDevice *d;
          I440FXState *s; // 北橋
           PIIX3State *piix3; // 南橋 (PCI-ISA)
          /* 创建 PCI 主总线设备 */
          dev = qdev_create(NULL, "i440FX-pcihost");
          s = FROM_SYSBUS(I440FXState, sysbus_from_qdev(dev)); // 請見 hw/sysbus.h 和 osdep.h
          /* 创建我们真正的 PCI 总线 */
          b = pci_bus_new(&s->busdev.qdev, NULL, 0);
          s->bus = b;
          /* 初始化主总线设备 */
          /* 创建主桥 */
          d = pci_create_simple(b, 0, device_name);
          *pi440fx_state = DO_UPCAST(PCII440FXState, dev, d);
          /* 创建 ISA 桥 (南橋) */
              piix3 = DO_UPCAST(PIIX3State, dev,
                      pci_create_simple_multifunction(b, -1, true, "PIIX3"));
              pci_bus_irqs(b, piix3_set_irq, pci_slot_get_pirq, piix3,
          /* 连接 8259 中断控制器,IOAPIC 貌似也和在一起 */
          piix3->pic = pic;
          (*pi440fx_state)->piix3 = piix3;
          *piix3_devfn = piix3->dev.devfn;
          ram_size = ram_size / 8 / 1024 / 1024;
          if (ram_size > 255)
              ram_size = 255;
          return b; /* 此後可將外設掛在這個 PCI bus */
以 i8259 為例:
static void i8259_set_irq(void *opaque, int irq, int level)
    pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
  1. 最後由 apic_local_deliver (Local APIC) 呼叫 cpu_interrupt 送出中斷給 virtual CPU。





  1. watch_mem_{read, write}
    static uint64_t watch_mem_read(void *opaque, target_phys_addr_t addr,
                                   unsigned size)
        check_watchpoint(addr & ~TARGET_PAGE_MASK, ~(size - 1), BP_MEM_READ);
        switch (size) {
        case 1: return ldub_phys(addr);
        case 2: return lduw_phys(addr);
        case 4: return ldl_phys(addr);
        default: abort();
    static const MemoryRegionOps watch_mem_ops = {
        .read = watch_mem_read,
        .write = watch_mem_write,
        .endianness = DEVICE_NATIVE_ENDIAN,
    • cpu_watchpoint_insert 用來插入 watchpoint。
    • qemu_add_vm_change_state_handler 用來這註冊當 QEMU 狀態有變化時會調用的函式。
  2. io_mem_init
    static void io_mem_init(void)
        memory_region_init_io(&io_mem_ram, &error_mem_ops, NULL, "ram", UINT64_MAX);
        memory_region_init_io(&io_mem_rom, &rom_mem_ops, NULL, "rom", UINT64_MAX);
        memory_region_init_io(&io_mem_unassigned, &unassigned_mem_ops, NULL,
                              "unassigned", UINT64_MAX);
        memory_region_init_io(&io_mem_notdirty, &notdirty_mem_ops, NULL,
                              "notdirty", UINT64_MAX);
        memory_region_init_io(&io_mem_subpage_ram, &subpage_ram_ops, NULL,
                              "subpage-ram", UINT64_MAX);
        memory_region_init_io(&io_mem_watch, &watch_mem_ops, NULL,
                              "watch", UINT64_MAX);
  3. check_watchpoint
    /* Generate a debug exception if a watchpoint has been hit.  */
    static void check_watchpoint(int offset, int len_mask, int flags)
        ... 略 ...
        // check_watchpoint 在 watch_mem_read 中被第一次呼叫時,env->watchpoint_hit 為 NULL。
        if (env->watchpoint_hit) {
            /* We re-entered the check after replacing the TB. Now raise
             * the debug interrupt so that is will trigger after the
             * current instruction. */
            // 重新執行觸發 watchpoint 的指令,會來到這裡。
              // env->interrupt_request 被設為 CPU_INTERRUPT_DEBUG,接著再返回 cpu_exec。(1.b)
            cpu_interrupt(env, CPU_INTERRUPT_DEBUG);
        vaddr = (env->mem_io_vaddr & TARGET_PAGE_MASK) + offset;
        // 查詢目前存取的內存位址是否有被監控。
         QTAILQ_FOREACH(wp, &env->watchpoints, entry) {
            if ((vaddr == (wp->vaddr & len_mask) ||
                 (vaddr & wp->len_mask) == wp->vaddr) && (wp->flags & flags)) {
                wp->flags |= BP_WATCHPOINT_HIT;
                // 第一次進到 check_watchpoint,env->watchpoint_hit 為 NULL。
                  if (!env->watchpoint_hit) {
                    env->watchpoint_hit = wp;
                    tb = tb_find_pc(env->mem_io_pc);
                    if (!tb) {
                        cpu_abort(env, "check_watchpoint: could not find TB for "
                                  "pc=%p", (void *)env->mem_io_pc);
                    cpu_restore_state(tb, env, env->mem_io_pc);
                    tb_phys_invalidate(tb, -1);
                    if (wp->flags & BP_STOP_BEFORE_ACCESS) {
                        env->exception_index = EXCP_DEBUG;
                    } else {
                        // 重新翻譯觸發 watchpoint 的 TB,從觸發 watchpoint 的那一條指令開始翻起。
                            // 返回至 cpu_exec 從觸發 watchpoint 的那一條指令開始執行。(1.a)
                        cpu_get_tb_cpu_state(env, &pc, &cs_base, &cpu_flags);
                        tb_gen_code(env, pc, cs_base, cpu_flags, 1);
                        cpu_resume_from_signal(env, NULL);
            } else {
                wp->flags &= ~BP_WATCHPOINT_HIT;
  4. cpu_exec
    int cpu_exec(CPUArchState *env)
        ... 略 ...
        for(;;) {
            if (setjmp(env->jmp_env) == 0) {
                /* if an exception is pending, we execute it here */
                if (env->exception_index >= 0) {
                    if (env->exception_index >= EXCP_INTERRUPT) {
                        /* exit request from the cpu execution loop */
                        ret = env->exception_index;
                        if (ret == EXCP_DEBUG) {
                            cpu_handle_debug_exception(env); // 處理 watchpoint。(1.b)
                    } else {
                        env->exception_index = -1;
                next_tb = 0; /* force lookup of first TB */
                for(;;) {
                    interrupt_request = env->interrupt_request;
                    if (unlikely(interrupt_request)) {
                        ... 略 ...
                        if (interrupt_request & CPU_INTERRUPT_DEBUG) {
                            env->interrupt_request &= ~CPU_INTERRUPT_DEBUG;
                            env->exception_index = EXCP_DEBUG;
                            cpu_loop_exit(env); // 返回 cpu_exec 外層迴圈。(1.a)
                        ... 略 ...
        ... 略 ...   
static void notdirty_mem_write(void *opaque, target_phys_addr_t ram_addr,
                               uint64_t val, unsigned size)
    int dirty_flags;
    dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
    if (!(dirty_flags & CODE_DIRTY_FLAG)) {
#if !defined(CONFIG_USER_ONLY)
        tb_invalidate_phys_page_fast(ram_addr, size);
        dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
    switch (size) {
    case 1:
        stb_p(qemu_get_ram_ptr(ram_addr), val); // ram_addr 是 GPA,qemu_get_ram_ptr 將其轉成對應的 HVA。
    case 2:
        stw_p(qemu_get_ram_ptr(ram_addr), val);
    case 4:
        stl_p(qemu_get_ram_ptr(ram_addr), val);
    dirty_flags |= (0xff & ~CODE_DIRTY_FLAG);
    cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
    /* we remove the notdirty callback only if the code has been
       flushed */
    if (dirty_flags == 0xff)
        tlb_set_dirty(cpu_single_env, cpu_single_env->mem_io_vaddr);
  1. stl_phys_notdirty (exec.c) 寫入 PTE。
    void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val)
        uint8_t *ptr;
        MemoryRegionSection *section;
        section = phys_page_find(addr >> TARGET_PAGE_BITS);
        if (!memory_region_is_ram(section->mr) || section->readonly) {
            addr = memory_region_section_addr(section, addr);
            if (memory_region_is_ram(section->mr)) {
                section = &phys_sections[phys_section_rom];
            io_mem_write(section->mr, addr, val, 4);
        } else {
            unsigned long addr1 = (memory_region_get_ram_addr(section->mr)
                                   & TARGET_PAGE_MASK)
                + memory_region_section_addr(section, addr);
            ptr = qemu_get_ram_ptr(addr1);
            stl_p(ptr, val);
            if (unlikely(in_migration)) {
                if (!cpu_physical_memory_is_dirty(addr1)) {
                    /* invalidate code */
                    tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
                    /* set dirty bit */
                        addr1, (0xff & ~CODE_DIRTY_FLAG));

GDB Stub

  • gdb_handle_packet 處理 client 送來的 request。
    static int gdb_handle_packet(GDBState *s, const char *line_buf)


  • linux-user/main.c - qemu user main
  • exec.c - virtual page mapping and translated block handling
  • cpu-exec.c- i386 emulator main execution loop
  • target-i386/translate.c - i386 translation
  • tcg/tcg.c - Tiny Code Generator for QEMU
    • tcg_out_xxx 負責將參數 (host binary) 寫入 TCGContext 所指的 code cache。
  • tcg/tcg-op.h - 提供產生 TCG IR 的函式,opcode 寫入 gen_opc_ptr 指向的緩衝區 (translate-all.c 裡的 gen_opc_buf); operand 寫入 gen_opparam_ptr 指向的緩衝區。
    • tcg_gen_xxx 產生 TCG IR。
  • target-i386/op_helper.c - Code snippets called from TCG generated code. Implement more complex operations that gcc gets better than TCG.
  • target-i386/helper.c - Helper functions specific to the CPU, but called from multiple places around QEMU. For example the MMU code belongs here.
  • target-i386/cpu.h 會引用 cpu-defs.h 定義個平台共用的資料結構
  • cpu_x86_exec/cpu_exec 代表呼叫的是 cpu_x86_exec,實際上被呼叫的是 cpu_exec。
    // target-i386/cpu.h
    #define cpu_exec cpu_x86_exe
  1. 當中斷發生時,translation block 必須 unchain。
    • cpu_interrupt (exec.c)→ cpu_unlink_tb (exec.c)
  2. 不同 ISA 有不同的中斷處理函式。請見 target-xxx/* 中的 do_interrupt_xxx。


      • exception_index 存放中斷號。hw,syscall 都會賦值給 exception_index。QEMU 利用 setjmp/longjmp 處理中斷,jmp_env 用來存放上下文。
      • interrupt_request
      • exit_request。全域變數 exit_request 只有在開啟 IO 執行緒的情況下才會被 cpu_signal 賦值。
      • CPUState *env 會被保存在特定一個 host 暫存器,以加速存取 CPUState。請見 exec.h。
        register struct CPUX86State *env asm(AREG0);
      • CPUState 已用 QOM 改寫。[Qemu-devel] [PULL] QOM CPUState for i386
    • 用來記錄 TB 所需訊息。tc_ptr 指向 translated code cache。TB 和客戶機物理頁面有所聯繫。
      • page_addr 紀錄 TB 所屬頁面。
      • page_next
    • 用來查找頁中的 TB
    • 反組譯 guest binary 所用的結構
    • 存放 TCG IR 所用的結構。
    • static_temps 用來存放運算過程中的中間值。
    • IOHandler 是 callback 函式
    • 段暫存器快取。x86 載入 TR 和 IDTR 其選擇符的時候,會一併把其段基地址、段限長度和描述符屬性載入。cpu_x86_load_seg_cache (target-i386/cpu.h) 負責填充 SegmentCache。


  1. in_asm 代表的是對 guest 的反組譯。out_asm 代表的是針對目標機器生成的組語。op 代表的是 QEMU 自己的 IR。
    # 修改 linux-user/main.c 裡面的 #define DEBUG_LOGFILE "/tmp/qemu.log"
    # 修改 exec.c 裡面的 logfilename
    $ qemu-i386 -d in_asm,out_asm,op hello
    $ less /tmp/qemu.log
  2. 得知 qemu 在什麼地方做 log。
    $ grep -r qemu_log qemu-0.14.2/*
  3. 加入額外的 log 選項。
    1. cpu-all.h 中加入
      #define CPU_LOG_IBTC       (1 << 10)
      #define CPU_LOG_TB_FIND    (1 << 11)
    2. 打印出新增的選項。
      const CPULogItem cpu_log_items[] = {
         { CPU_LOG_IBTC, "ibtc",
           "print ibtc return address" },
    3. 在代碼中使用新增的選項。
      #ifdef DEBUG_DISAS
        if (qemu_loglevel_mask(CPU_LOG_IBTC)) {
            qemu_log("%lu\t%p\n", guest_eip, next_tb->tc_ptr);
  4. 定義自己的 helper function。例如要為 i386 guest 新增 helper function
    1. target-i386/helper.h
      // 傳入 helper function 的參數請見 def-helper.h,也請參考 tcg/README。
      // DEF_HELPER_FLAGS_? 中的 ? 代表參數個數。參數的意義分別是: helper function 的名稱,修飾子 (TCG_CALL_CONST 表示
      // 該 helper function 是否會修改到全域變數),回傳執型別和參數型別。
      // 相對應的函式: void *helper_lookup_ibtc(target_ulong guest_eip, CPUState *env)
      // 可用 gen_helper_lookup_ibtc(ibtc_host_eip, cpu_T[0]) 產生呼叫該 helper function 的 TCG IR。
      DEF_HELPER_FLAGS_2(lookup_ibtc, TCG_CALL_CONST, ptr, tl, env)
  5. 因為 QEMU 將跳入/出 code cache 當作函式呼叫,它會把 guest call instruction 翻譯成 jmp instruction。如果不這樣做的話,code cache 中的 longjmp (回 QEMU) 會破壞 call stack。
  6. 修改 monitor.[ch] 增加 QEMU Monitor 的功能。
  7. 查看 $BUILD/config-host.mak 得知設定參數。


  • Makefile。
    # configure 中有 echo "TARGET_DIRS=$target_list" >> $config_host_mak
    # $target_list 即為 --target-list=i386-linux-user 中的 i386-linux-user。
    include config-host.mak
    # 將 $(TARGET_DIRS) 中的 % 替換成 subdir-%。
    SUBDIR_RULES=$(patsubst %,subdir-%, $(TARGET_DIRS))
    # $(GENERATED_HEADERS) 是編譯時才產生的標頭檔。
    subdir-%: $(GENERATED_HEADERS)
      $(call quiet-command,$(MAKE) $(SUBDIR_MAKEFLAGS) -C $* V="$(V)" TARGET_DIR="$*/" all,)
    ifneq ($(wildcard config-host.mak),)
    include $(SRC_PATH)/Makefile.objs
    $(universal-obj-y) $(common-obj-y): $(GENERATED_HEADERS)
    subdir-libcacard: $(oslib-obj-y) $(trace-obj-y) qemu-timer-common.o
    # 從 $(SUBDIR_RULES) 濾出 %-softmmu,% 代表任意長度的字串。
    $(filter %-softmmu,$(SUBDIR_RULES)): $(universal-obj-y) $(trace-obj-y) $(common-obj-y) subdir-libdis
    $(filter %-user,$(SUBDIR_RULES)): $(GENERATED_HEADERS) $(universal-obj-y) $(trace-obj-y) subdir-libdis-user subdir-libuser
  • Makefile.target。QEMU_PROG 即是最後生成的執行檔。一般我們會在 $BUILD 目錄底下編譯,與 $SRC 目錄區隔。
    # Linux user emulator target
    # call 負責將參數,在此為 $(SRC_PATH)/linux-user:$(SRC_PATH)/linux-user/$(TARGET_ABI_DIR),
    # 傳遞給表達式 set-vpath。rules.mak 定義 set-vpath。 
    $(call set-vpath, $(SRC_PATH)/linux-user:$(SRC_PATH)/linux-user/$(TARGET_ABI_DIR))
    # 如果 --target-list=i386-linux-user,TARGET_I386 會設成 y,最後成為 obj-y += vm86.o。
    # 可以把自己的 *.o 加在 obj-y 之後。
    obj-$(TARGET_I386) += vm86.o
  • rules.mak 14)
    # 由 *.c 生成 *.o 檔。$@ 代表欲生成的 *.o 檔,@< 代表輸入的檔案,在此為 *.c 檔。
    # 在此可以新增條件,用 clang 生成 LLVM 的 *.bc 檔。
    %.o: %.c
      $(call quiet-command,$(CC) $(QEMU_INCLUDES) $(QEMU_CFLAGS) $(QEMU_DGFLAGS) $(CFLAGS) -c -o $@ $<,"  CC    $(TARGET_DIR)$@")
    # V 即為 `make V=1` 中的 V。此時會將執行的命令印在螢幕上,否則 @ 會使得執行的命令不顯示在螢幕上。
    # $1 即為 $(CC) $(QEMU_INCLUDES) $(QEMU_CFLAGS) $(QEMU_DGFLAGS) $(CFLAGS) -c -o $@ $<
    # $2 即為 "  CC    $(TARGET_DIR)$@"
    quiet-command = $(if $(V),$1,$(if $(2),@echo $2 && $1, @$1))
    VPATH_SUFFIXES = %.c %.h %.S %.m %.mak %.texi %.sh
    # set-vpath 設定 VPATH。VPATH 是變量,vpath 是關鍵字。
    # VPATH 是變量,告知 make 在 $BUILD 目錄底下若找不到相應的檔案,應該要再找那些路徑。
    # foreach 是將 $(VPATH_SUFFIXES) 中的變數逐一放至 PATTERN,再執行 $(eval vpath $(PATTERN) $1)。
    # vpath 為符合 $(PATTERN) 的文件指定搜索目錄 $1,在此即為 $(SRC_PATH)/linux-user:$(SRC_PATH)/linux-user/$(TARGET_ABI_DIR))。
    set-vpath = $(if $1,$(foreach PATTERN,$(VPATH_SUFFIXES),$(eval vpath $(PATTERN) $1)))