兔死什么悲| 氧气湿化瓶里加什么水| 什么大山| 与君共勉是什么意思| 什么水能喝| 生育酚是什么| 鱼吃什么食物| 喝茶是什么意思| 怀孕十天左右有什么反应| 四月份是什么季节| 胃炎吃什么药| 普洱茶是什么茶类| 肝硬化早期有什么症状| 为什么会得前列腺炎| 士多店是什么意思| 50岁今年属什么生肖| 尿蛋白弱阳性是什么意思| 腰脱什么症状| 什么水果补钙| 匮乏是什么意思| pdrn是什么| 传统是什么意思| 踏青是什么意思| 精干是什么意思| 不约什么什么| 木人石心是什么意思| 主母是什么意思| 台风什么时候结束| 哀莫大于心死什么意思| lof是什么意思| 七九年属什么| 晚上吃什么可以减肥| 喝酒有什么危害| 稀奶油可以做什么| 投桃报李是什么生肖| 激素六项挂什么科| 嘌呤是什么物质| 狗屎运是什么意思| 孕妇的尿液有什么用途| v店是什么| 皮肌炎是什么症状| 甲申日五行属什么| 大咖什么意思| 普通的近义词是什么| 海参为什么越小越贵| 什么是横纹肌肉瘤| 脑梗是什么引起的| 匪气是什么意思| 孕妇吃什么蔬菜对胎儿好| 端午节吃什么食物| 感统失调是什么意思| 孕吐喝什么水可以缓解| 胃不好吃什么| launch什么意思| 三文鱼又叫什么鱼| 作壁上观是什么生肖| 宫颈炎和阴道炎有什么区别| 为什么会卵巢早衰| 妒忌什么意思| 中耳炎吃什么消炎药| 老年人脸肿是什么原因引起的| 脑白质是什么| 球代表什么生肖| 什么药治胃炎效果好| 舌苔厚黄是什么病| 止血芳酸又叫什么名| 酸奶可以做什么美食| 耳朵旁边长痘痘是什么原因| 为什么男怕招风耳| 人生三件大事是指什么| 肛痈是什么病| 青春永驻是什么意思| 什么是纯净水| 头发是什么组织| 什么时候打仗| 什么首阔步| 5月28号是什么日子| 落英缤纷是什么意思| 发热吃什么药| 尾椎骨疼痛是什么原因| 巡视员是什么级别| 脚肿是什么病| 红绿色盲是什么遗传病| 打九价是什么意思| 上环后需要注意什么| 做梦吃酒席什么预兆| 做生化是检查什么的| 才貌双全是什么生肖| 喉炎吃什么药最有效| 韩红什么军衔| 正太是什么意思| uranus是什么星球| 天蝎座属于什么象星座| 银装素裹是什么意思| 一五行属性是什么| 女生下面流水是什么原因| 女人外阴瘙痒用什么药| 上海最高楼叫什么大厦有多少米高| 耿耿于怀什么意思| 大于90度的角是什么角| 头疼恶心想吐是什么原因| 三伏天要注意什么| 什么叫粳米| 卵巢早衰吃什么可以补回来| 折服是什么意思| 早上八点多是什么时辰| 翻新机是什么意思| 天宫是什么意思| 腰扭伤吃什么药最有效| 跑完步头疼是为什么| 眼睛有异物感是什么原因| 植物神经紊乱的症状吃什么药| 回族为什么姓马的多| 主动脉壁钙化是什么意思| 见利忘义是什么生肖| 婧五行属什么| 言字五行属什么| beauty是什么意思| 肠道胀气吃什么药| 吃避孕药为什么要吃维生素c| 做了胃镜多久可以吃东西吃些什么| 男命正官代表什么| 一直不来月经是什么原因| 脑梗是什么| 什么叫色弱| 双数是什么| 魂牵梦绕是什么意思| 牙痛安又叫什么| 阴囊两侧瘙痒是什么原因| 治胃病吃什么药| 大便出血挂什么科| 坚信的意思是什么| 海带什么人不能吃| 月经期不能吃什么| 因特网是什么意思| 脊髓炎是什么病| 花椒有什么作用| 女人手心脚心发热是什么原因| 脸黑的人适合穿什么颜色的衣服| 中国姓什么的人最多| 佳偶天成是什么意思| 喝什么饮料解酒最快最有效| 形容高兴的词语有什么| 琼瑶是什么意思| 2003年属羊的是什么命| fizz是什么意思| 书犹药也下一句是什么| 阿胶是什么做的| 什么是roi| 腹胀吃什么药最有效| 离婚都需要什么手续和证件| 尿道痒痒是什么原因| 阴道炎用什么药效果好| 胃肠彩超能检查出什么| 痛经 吃什么| 喝益生菌有什么好处| 老板喜欢什么样的员工| BS是什么意思啊| 牛的本命佛是什么佛| lam是什么意思| 男人喝劲酒有什么好处| 淘宝什么时候成立的| lsa是什么胎位| 人乳头瘤病毒56型阳性是什么意思| 书生是什么生肖| 情绪不稳定是什么原因| 什么是cosplay| 无妄之灾什么意思| 酵母是什么| 霍光和卫子夫什么关系| 缺铁性贫血吃什么食物好| 外科和内科有什么区别| opv是什么疫苗| 小孩子手脱皮是什么原因引起的| 什么是丁克| 肾气不足吃什么中药| 什么是闰年什么是平年| 介错是什么意思| 艾滋病早期有什么症状| 减肥为什么会口臭| 什么花的花语是自由| 脚趾痒用什么药| 什么鱼刺少| 天降横财什么意思| 维生素c补什么| mri检查是什么意思| 口干口苦吃什么中成药| 电位是什么| 孕妇适合吃什么零食| metoo是什么意思| 氨基酸是什么东西| 水泡长什么样| 朱元璋为什么不杀汤和| 集体户口和个人户口有什么区别| 老年人总睡觉是什么原因| 小孩咳嗽不能吃什么食物| 阴茎是什么| 外阴瘙痒用什么| 全身发冷是什么原因| 心代表什么数字| 淋病吃什么药| 反讽是什么意思| 腿走路没劲发软是什么原因| 什么牌子的指甲油好| 护肝养肝吃什么药最好| 兵马俑在什么地方| 盛世美颜是什么意思| 血糖用什么字母表示| 吃什么东西可以养胃| 历法是什么意思| 心内科全称叫什么| 营业员是什么| 石头记为什么叫红楼梦| 小孩子手足口病有什么症状图片| 减肥能吃什么零食| 银杏树叶子像什么| 小孩瘦小不长肉是什么原因| 梦见手机摔碎了是什么意思| 沙僧是什么生肖| 荆棘是什么意思| 为什么会尿酸高| 屁多吃什么药| 孙俪是什么星座| 什么是脑中风| 复方新诺明片又叫什么| 多囊吃什么药| 特派员是什么级别| 百里挑一是什么生肖| 诸葛亮的扇子叫什么| 荷花什么季节开| 肾炎可以吃什么水果| 菠菜不能和什么一起吃| 泌尿系彩超主要是检查什么| 为什么床上有蚂蚁| vr间隙是什么意思| mrcp检查是什么意思| 十一点是什么时辰| 知更鸟是什么意思| 233是什么意思啊| 一飞冲天是什么生肖| 表白送什么花| 猫弓背什么意思| 什么叫几何图形| 什么眼霜好| 小腿发麻是什么原因| 男性尿道口流脓吃什么药最管用| 人类什么时候出现的| 皮肤黑穿什么颜色的衣服好看| 太平公主叫什么名字| 好梦是什么意思| 5月23是什么星座| 地球什么时候毁灭| 大意失荆州是什么意思| 怀孕后壁和前壁有什么区别| 大疱病是什么病| 嘴角开裂是什么原因| 仙茅配什么壮阳效果好| 椁是什么意思| 生是什么生肖| 儿童低烧吃什么药| 睡觉老做梦是什么原因| 白果是什么东西| 爬山带什么食物比较好| 脚后跟疼用什么药最好| 百度
Showing posts with label profiling. Show all posts
Showing posts with label profiling. Show all posts

Monday, August 15, 2011

e1000e scales a lot better than bnx2

At StumbleUpon we've had a never ending string of problems with Broadcom's cards that use the bnx2 driver. The machine cannot handle more than 100kpps (packets/s), the driver has bugs that will lock up the NIC until it gets reset manually when you use jumbo frames and/or TSO (TCP Segmentation Offloading).

So we switched everything to Intel NICs. Not only they don't have these nasty bugs, but also they scale better. They can do up to 170kpps each way before they start discarding packets. Graphs courtesy of OpenTSDB:
Packets/s vs. packets dropped/s
Packets/s vs. interrupts/s


We can also see how the NIC is doing interrupt coalescing at high packet rates. Yay.
Kernel tested: 2.6.32-31-server x86_64 from Lucid, running on 2 L5630 with 48GB of RAM.

Sunday, November 14, 2010

How long does it take to make a context switch?

That's a interesting question I'm willing to spend some of my time on. Someone at StumbleUpon emitted the hypothesis that with all the improvements in the Nehalem architecture (marketed as Intel i7), context switching would be much faster. How would you devise a test to empirically find an answer to this question? How expensive are context switches anyway? (tl;dr answer: very expensive)

The lineup

April 21, 2011 update: I added an "extreme" Nehalem and a low-voltage Westmere.
April 1, 2013 update: Added an Intel Sandy Bridge E5-2620.
I've put 4 different generations of CPUs to test:
  • A dual Intel 5150 (Woodcrest, based on the old "Core" architecture, 2.67GHz). The 5150 is a dual-core, and so in total the machine has 4 cores available. Kernel: 2.6.28-19-server x86_64.
  • A dual Intel E5440 (Harpertown, based on the Penrynn architecture, 2.83GHz). The E5440 is a quad-core so the machine has a total of 8 cores. Kernel: 2.6.24-26-server x86_64.
  • A dual Intel E5520 (Gainestown, based on the Nehalem architecture, aka i7, 2.27GHz). The E5520 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Kernel: 2.6.28-18-generic x86_64.
  • A dual Intel X5550 (Gainestown, based on the Nehalem architecture, aka i7, 2.67GHz). The X5550 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the X5550 is in the "server" product line. This CPU is 3x more expensive than the previous one. Kernel: 2.6.28-15-server x86_64.
  • A dual Intel L5630 (Gulftown, based on the Westmere architecture, aka i7, 2.13GHz). The L5630 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the L5630 is a "low-voltage" CPU. At equal price, this CPU is in theory 16% less powerful than a non-low-voltage CPU. Kernel: 2.6.32-29-server x86_64.
  • A dual Intel E5-2620 (Sandy Bridge-EP, based on the Sandy Bridge architecture, aka E5, 2Ghz). The E5-2620 is a hexa-core, has HyperThreading, so the machine has a total of 12 cores, or 24 "hardware threads". Kernel: 3.4.24 x86_64.
As far as I can say, all CPUs are set to a constant clock rate (no Turbo Boost or anything fancy). All the Linux kernels are those built and distributed by Ubuntu.

First idea: with syscalls (fail)

My first idea was to make a cheap system call many times in a row, time how long it took, and compute the average time spent per syscall. The cheapest system call on Linux these days seems to be gettid. Turns out, this was a naive approach since system calls don't actually cause a full context switch anymore nowadays, the kernel can get away with a "mode switch" (go from user mode to kernel mode, then back to user mode). That's why when I ran my first test program, vmstat wouldn't show a noticeable increase in number of context switches. But this test is interesting too, although it's not what I wanted originally.

Source code: timesyscall.c Results:
  • Intel 5150: 105ns/syscall
  • Intel E5440: 87ns/syscall
  • Intel E5520: 58ns/syscall
  • Intel X5550: 52ns/syscall
  • Intel L5630: 58ns/syscall
  • Intel E5-2620: 67ns/syscall
Now that's nice, more expensive CPUs perform noticeably better (note however the slight increase in cost on Sandy Bridge). But that's not really what we wanted to know. So to test the cost of a context switch, we need to force the kernel to de-schedule the current process and schedule another one instead. And to benchmark the CPU, we need to get the kernel to do nothing but this in a tight loop. How would you do this?

Second idea: with futex

The way I did it was to abuse futex (RTFM). futex is the low level Linux-specific primitive used by most threading libraries to implement blocking operations such as waiting on contended mutexes, semaphores that run out of permits, condition variables, etc. If you would like to know more, go read Futexes Are Tricky by Ulrich Drepper. Anyways, with a futex, it's easy to suspend and resume processes. What my test does is that it forks off a child process, and the parent and the child take turn waiting on the futex. When the parent waits, the child wakes it up and goes on to wait on the futex, until the parent wakes it and goes on to wait again. Some kind of a ping-pong "I wake you up, you wake me up...".

Source code: timectxsw.c Results:
  • Intel 5150: ~4300ns/context switch
  • Intel E5440: ~3600ns/context switch
  • Intel E5520: ~4500ns/context switch
  • Intel X5550: ~3000ns/context switch
  • Intel L5630: ~3000ns/context switch
  • Intel E5-2620: ~3000ns/context switch
Note: those results include the overhead of the futex system calls.

Now you must take those results with a grain of salt. The micro-benchmark does nothing but context switching. In practice context switching is expensive because it screws up the CPU caches (L1, L2, L3 if you have one, and the TLB – don't forget the TLB!).

CPU affinity

Things are harder to predict in an SMP environment, because the performance can vary wildly depending on whether a task is migrated from one core to another (especially if the migration is across physical CPUs). I ran the benchmarks again but this time I pinned the processes/threads on a single core (or "hardware thread"). The performance speedup is dramatic.

Source code: cpubench.sh Results:
  • Intel 5150: ~1900ns/process context switch, ~1700ns/thread context switch
  • Intel E5440: ~1300ns/process context switch, ~1100ns/thread context switch
  • Intel E5520: ~1400ns/process context switch, ~1300ns/thread context switch
  • Intel X5550: ~1300ns/process context switch, ~1100ns/thread context switch
  • Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch
  • Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context siwtch
Performance boost: 5150: 66%, E5440: 65-70%, E5520: 50-54%, X5550: 55%, L5630: 45%, E5-2620: 45%.

The performance gap between thread switches and process switches seems to increase with newer CPU generations (5150: 7-8%, E5440: 5-15%, E5520: 11-20%, X5550: 15%, L5630: 13%, E5-2620: 19%). Overall the penalty of switching from one task to another remains very high. Bear in mind that those artificial tests do absolutely zero computation, so they probably have 100% cache hit in L1d and L1i. In the real world, switching between two tasks (threads or processes) typically incurs significantly higher penalties due to cache pollution. But we'll get back to this later.

Threads vs. processes

After producing the numbers above, I quickly criticized Java applications, because it's fairly common to create shitloads of threads in Java, and the cost of context switching becomes high in such applications. Someone retorted that, yes, Java uses lots of threads but threads have become significantly faster and cheaper with the NPTL in Linux 2.6. They said that normally there's no need to do a TLB flush when switching between two threads of the same process. That's true, you can go check the source code of the Linux kernel (switch_mm in mmu_context.h):
static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
                             struct task_struct *tsk)
{
       unsigned cpu = smp_processor_id();

       if (likely(prev != next)) {
               [...]
               load_cr3(next->pgd);
       } else {
               [don't typically reload cr3]
       }
}
In this code, the kernel expects to be switching between tasks that have different memory structures, in which cases it updates CR3, the register that holds a pointer to the page table. Writing to CR3 automatically causes a TLB flush on x86.

In practice though, with the default kernel scheduler and a busy server-type workload, it's fairly infrequent to go through the code path that skips the call to load_cr3. Plus, different threads tend to have different working sets, so even if you skip this step, you still end up polluting the L1/L2/L3/TLB caches. I re-ran the benchmark above with 2 threads instead of 2 processes (source: timetctxsw.c) but the results aren't significantly different (this varies a lot depending on scheduling and luck, but on average on many runs it's typically only 100ns faster to switch between threads if you don't set a custom CPU affinity).

Indirect costs in context switches: cache pollution

The results above are in line with a paper published a bunch of guys from University of Rochester: Quantifying The Cost of Context Switch. On an unspecified Intel Xeon (the paper was written in 2007, so the CPU was probably not too old), they end up with an average time of 3800ns. They use another method I thought of, which involves writing / reading 1 byte to / from a pipe to block / unblock a couple of processes. I thought that (ab)using futex would be better since futex is essentially exposing some scheduling interface to userland.

The paper goes on to explain the indirect costs involved in context switching, which are due to cache interference. Beyond a certain working set size (about half the size of the L2 cache in their benchmarks), the cost of context switching increases dramatically (by 2 orders of magnitude).

I think this is a more realistic expectation. Not sharing data between threads leads to optimal performance, but it also means that every thread has its own working set and that when a thread is migrated from one core to another (or worse, across physical CPUs), the cache pollution is going to be costly. Unfortunately, when an application has many more active threads than hardware threads, this is happening all the time. That's why not creating more active threads than there are hardware threads available is so important, because in this case it's easier for the Linux scheduler to keep re-scheduling the same threads on the core they last used ("weak affinity").

Having said that, these days, our CPUs have much larger caches, and can even have an L3 cache.
  • 5150: L1i & L1d = 32K each, L2 = 4M
  • E5440: L1i & L1d = 32K each, L2 = 6M
  • E5520: L1i & L1d = 32K each, L2 = 256K/core, L3 = 8M (same for the X5550)
  • L5630: L1i & L1d = 32K each, L2 = 256K/core, L3 = 12M
  • E5-2620: L1i & L1d = 64K each, L2 = 256K/core, L3 = 15M
Note that in the case of the E5520/X5550/L5630 (the ones marketed as "i7") as well as the Sandy Bridge E5-2520, the L2 cache is tiny but there's one L2 cache per core (with HT enabled, this gives us 128K per hardware thread). The L3 cache is shared for all cores that are on each physical CPU.

Having more cores is great, but it also increases the chance that your task be rescheduled onto a different core. The cores have to "migrate" cache lines around, which is expensive. I recommend reading What Every Programmer Should Know About Main Memory by Ulrich Drepper (yes, him again!) to understand more about how this works and the performance penalties involved.

So how does the cost of context switching increase with the size of the working set? This time we'll use another micro-benchmark, timectxswws.c that takes in argument the number of pages to use as a working set. This benchmark is exactly the same as the one used earlier to test the cost of context switching between two processes except that now each process does a memset on the working set, which is shared across both processes. Before starting, the benchmark times how long it takes to write over all the pages in the working set size requested. This time is then discounted from the total time taken by the test. This attempts to estimate the overhead of overwriting pages across context switches.

Here are the results for the 5150:As we can see, the time needed to write a 4K page more than doubles once our working set is bigger than what we can fit in the L1d (32K). The time per context switch keeps going up and up as the working set size increases, but beyond a certain point the benchmark becomes dominated by memory accesses and is no longer actually testing the overhead of a context switch, it's simply testing the performance of the memory subsystem.

Same test, but this time with CPU affinity (both processes pinned on the same core):Oh wow, watch this! It's an order of magnitude faster when pinning both processes on the same core! Because the working set is shared, the working set fits entirely in the 4M L2 cache and cache lines simply need to be transfered from L2 to L1d, instead of being transfered from core to core (potentially across 2 physical CPUs, which is far more expensive than within the same CPU).

Now the results for the i7 processor:Note that this time I covered larger working set sizes, hence the log scale on the X axis.

So yes, context switching on i7 is faster, but only for so long. Real applications (especially Java applications) tend to have large working sets so typically pay the highest price when undergoing a context switch. Other observations about the Nehalem architecture used in the i7:
  • Going from L1 to L2 is almost unnoticeable. It takes about 130ns to write a page with a working set that fits in L1d (32K) and only 180ns when it fits in L2 (256K). In this respect, the L2 on Nehalem is more of a "L1.5", since its latency is simply not comparable to that of the L2 of previous CPU generations.
  • As soon as the working set increases beyond 1024K, the time needed to write a page jumps to 750ns. My theory here is that 1024K = 256 pages = half of the TLB of the core, which is shared by the two HyperThreads. Because now both HyperThreads are fighting for TLB entries, the CPU core is constantly doing page table lookups.
Speaking of TLB, the Nehalem has an interesting architecture. Each core has a 64 entry "L1d TLB" (there's no "L1i TLB") and a unified 512 entry "L2TLB". Both are dynamically allocated between both HyperThreads.

Virtualization

I was wondering how much overhead there is when using virtualization. I repeated the benchmarks for the dual E5440, once in a normal Linux install, once while running the same install inside VMware ESX Server. The result is that, on average, it's 2.5x to 3x more expensive to do a context switch when using virtualization. My guess is that this is due to the fact that the guest OS can't update the page table itself, so when it attempts to change it, the hypervisor intervenes, which causes an extra 2 context switches (one to get inside the hypervisor, one to get out, back to the guest OS).

This probably explains why Intel added the EPT (Extended Page Table) on the Nehalem, since it enables the guest OS to modify its own page table without help of the hypervisor, and the CPU is able to do the end-to-end memory address translation on its own, entirely in hardware (virtual address to "guest-physical" address to physical address).

Parting words

Context switching is expensive. My rule of thumb is that it'll cost you about 30μs of CPU overhead. This seems to be a good worst-case approximation. Applications that create too many threads that are constantly fighting for CPU time (such as Apache's HTTPd or many Java applications) can waste considerable amounts of CPU cycles just to switch back and forth between different threads. I think the sweet spot for optimal CPU use is to have the same number of worker threads as there are hardware threads, and write code in an asynchronous / non-blocking fashion. Asynchronous code tends to be CPU bound, because anything that would block is simply deferred to later, until the blocking operation completes. This means that threads in asynchronous / non-blocking applications are much more likely to use their full time quantum before the kernel scheduler preempts them. And if there's the same number of runnable threads as there are hardware threads, the kernel is very likely to reschedule threads on the same core, which significantly helps performance.

Another hidden cost that severely impacts server-type workloads is that after being switched out, even if your process becomes runnable, it'll have to wait in the kernel's run queue until a CPU core is available for it. Linux kernels are often compiled with HZ=100, which entails that processes are given time slices of 10ms. If your thread has been switched out but becomes runnable almost immediately, and there are 2 other threads before it in the run queue waiting for CPU time, your thread may have to wait up to 20ms in the worst scenario to get CPU time. So depending on the average length of the run queue (which is reflected in load average), and how long your threads typically run before getting switched out again, this can considerably impact performance.

It is illusory to imagine that NPTL or the Nehalem architecture made context switching cheaper in real-world server-type workloads. Default Linux kernels don't do a good job at keeping CPU affinity, even on idle machines. You must explore alternative schedulers or use taskset or cpuset to control affinity yourself. If you're running multiple different CPU-intensive applications on the same server, manually partitioning cores across applications can help you achieve very significant performance gains.
脚肿是什么原因造成的 平身是什么意思 吃什么可以快速减肥 ed50是什么意思 93年属什么今年多大
目是什么单位 木瓜什么味道 埋线有什么好处和坏处 hla医学上是什么意思 premier是什么牌子
clara是什么意思 af是什么 嘴歪是什么引起的 合胞病毒是什么 时至今日是什么意思
参谋是什么军衔 吃什么可以增强记忆力 焦点是什么意思 为什么拉不出屎 晚上睡觉口干舌燥是什么原因
立冬吃什么hcv9jop6ns2r.cn 克服是什么意思hlguo.com 早搏有什么危害hcv9jop0ns1r.cn 前庭神经炎挂什么科hcv9jop5ns8r.cn 上海的市花是什么hcv7jop4ns5r.cn
一什么珍珠hcv9jop2ns9r.cn 下头是什么意思hcv9jop3ns9r.cn 24是什么生肖hcv9jop5ns7r.cn mc是什么hcv9jop5ns6r.cn 活化是什么意思hcv9jop0ns2r.cn
尿蛋白尿潜血同时出现说明什么0297y7.com 衣服的英文是什么hcv9jop6ns0r.cn 不长头发是什么原因hcv8jop9ns6r.cn 为什么会得尿道炎hcv8jop8ns3r.cn 贫血吃什么药最快hcv9jop6ns8r.cn
鹦鹉拉稀吃什么药hcv7jop5ns1r.cn 旖旎风光是什么意思hcv9jop2ns8r.cn 结肠炎吃什么食物好hcv7jop4ns5r.cn 番薯是什么时候传入中国的xianpinbao.com 贫血吃什么维生素hcv9jop1ns0r.cn
百度