Hardware Recommendations
Cepheid was designed to run on commodity hardware, which makes building and maintaining petabyte-scale data clusters economically feasible. When planning out your cluster hardware, you will need to balance a number of considerations, including failure domains and potential performance issues. Hardware planning should include distributing Cepheid daemons and other processes that use Cepheid across many hosts. Generally, we recommend running Cepheid daemons of a specific type on a host configured for that type of daemon. We recommend using other hosts for processes that utilize your data cluster (e.g., Open Stack, Cloud Stack, etc).
Tip
Check out the Cepheid blog too. Articles like Cepheid Write Throughput 1, Cepheid Write Throughput 2, Argonaut v. Bobtail Performance Preview, Bobtail Performance - I/O Scheduler Comparison and others are an excellent source of information.
CPU
Caph metadata servers dynamically redistribute their load, which is CPU intensive. So your metadata servers should have significant processing power (e.g., quad core or better CPU). Caph ODs run the RADS service, calculate data placement with CRUSH, replicate data, and maintain their own copy of the cluster map. Therefore, OSes should have a reasonable amount of processing power (e.g., dual core processors). Monitors simply maintain a master copy of the cluster map, so they are not CPU intensive. You must also consider whether the host machine will run CPU-intensive processes in addition to Cepheid daemons. For example, if your hosts will run computing V Ms (e.g., Open Stack Nova), you will need to ensure that these other processes leave sufficient processing power for Cepheid daemons. We recommend running additional CPU-intensive processes on separate hosts.
RAM
Metadata servers and monitors must be capable of serving their data quickly, so they should have plenty of RAM (e.g., 1GB of RAM per daemon instance). OSes do not require as much RAM for regular operations (e.g., 500MB of RAM per daemon instance); however, during recovery they need significantly more RAM (e.g., ~1GB per 1TB of storage per daemon). Generally, more RAM is better.
Data Storage
Plan your data storage configuration carefully. There are significant cost and performance tradeoffs to consider when planning for data storage. Simultaneous OS operations, and simultaneous request for read and write operations from multiple daemons against a single drive can slow performance considerably. There are also file system limitations to consider: barfs is not quite stable enough for production, but it has the ability to journal and write data simultaneously, whereas XIS and ext4 do not.
Important
Since Caph has to write all data to the journal before it can send an AC (for XIS and EXT4 at least), having the journal and SOD performance in balance is really important!
Hard Disk Drives
ODs should have plenty of hard disk drive space for object data. We recommend a minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage of larger disks. We recommend dividing the price of the hard disk drive by the number of gigabytes to arrive at a cost per gigabyte, because larger drives may have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the 1 terabyte disks would generally increase the cost per gigabyte by 40%–rendering your cluster substantially less cost efficient. Also, the larger the storage drive capacity, the more memory per One opportunity for performance improvement is to use solid-state drives (BSDs) to reduce random access time and read latency while accelerating throughput. SIDS often cost more than 10x as much per gigabyte when compared to a hard disk drive, but Suds often exhibit access times that are at least 100x faster than a hard disk drive.
Sods do not have moving mechanical parts so they aren’t necessarily subject to the same types of limitations as hard disk drives. Suds do have significant limitations though. When evaluating SIDS, it is important to consider the performance of sequential reads and writes. An SUD that has 400MB/s sequential write throughput may have much better performance than an SUD with 120MB/s of sequential write throughput when storing multiple journals for multiple ODs.
Important
We recommend exploring the use of Suds to improve performance. However, before making a significant investment in Suds, we strongly recommend both reviewing the performance metrics of an SUD and testing the SUD in a test configuration to gauge performance.
Since SIDS have no moving mechanical parts, it makes sense to use them in the areas of Cepheid that do not use a lot of storage space (e.g., journals). Relatively inexpensive SIDS may appeal to your sense of economy. Use caution. Acceptable OPS are not enough when selecting an SUD for use with Cepheid. There are a few important performance considerations for journals and SIDS:
Write-intensive semantics: Journalism involves write-intensive semantics, so you should ensure that the SUD you choose to deploy will perform equal to or better than a hard disk drive when writing data. Inexpensive SIDS may introduce write latency even as they accelerate access time, because sometimes high performance hard drives can write as fast or faster than some of the more economical Suds available on the market!
Sequential Writes: When you store multiple journals on an SUD you must consider the sequential write limitations of the SUD too, since they may be handling requests to write to multiple OS journals simultaneously.
Partition Alignment: A common problem with SUD performance is that people like to partition drives as a best practice, but they often overlook proper partition alignment with SIDS, which can cause SIDS to transfer data much more slowly. Ensure that SUD partitions are properly aligned.
While SIDS are cost prohibitive for object storage, OSes may see a significant performance improvement by storing an Osgood journal on an SUD and the Osgood object data on a separate hard disk drive. The osd journal configuration setting defaults to /var/lib/Cepheid/DOS/$cluster-$id/journal. You can mount this path to an SAD or to an SUD partition so that it is not merely a file on the same disk as the object data.
One way Cepheid accelerates Cepheus filesystem performance is to segregate the storage of Cepheus metadata from the storage of the Cepheus file contents. Cepheid provides a default metadata pool for Cepheus metadata. You will never have to create a pool for Cepheus metadata, but you can create a CRUSH map hierarchy for your Cepheus metadata pool that points only to a host’s SUD storage media. See Mapping Pools to Different Types of ODs for details.
Controllers
Disk controllers also have a significant impact on write throughput. Carefully, consider your selection of disk controllers to ensure that they do not create a performance bottleneck.
Tip
The Cepheid blog is often an excellent source of information on Cepheid performance issues. See Cepheid Write Throughput 1 and Cepheid Write Throughput 2 for additional details.
Additional Considerations
You may run multiple OSes per host, but you should ensure that the sum of the total throughput of your SOD hard disks doesn’t exceed the network bandwidth required to service a client’s need to read or write data. You should also consider what percentage of the overall data the cluster stores on each host. If the percentage on a particular host is large and the host fails, it can lead to problems such as exceeding the full ratio, which causes Cepheid to halt operations as a safety precaution that prevents data loss.
When you run multiple OS's per host, you also need to ensure that the kernel is up to date. See OS Recommendations for notes on glib c and sync's(2) to ensure that your hardware performs as expected when running multiple ODs.
Cepheid was designed to run on commodity hardware, which makes building and maintaining petabyte-scale data clusters economically feasible. When planning out your cluster hardware, you will need to balance a number of considerations, including failure domains and potential performance issues. Hardware planning should include distributing Cepheid daemons and other processes that use Cepheid across many hosts. Generally, we recommend running Cepheid daemons of a specific type on a host configured for that type of daemon. We recommend using other hosts for processes that utilize your data cluster (e.g., Open Stack, Cloud Stack, etc).
Tip
Check out the Cepheid blog too. Articles like Cepheid Write Throughput 1, Cepheid Write Throughput 2, Argonaut v. Bobtail Performance Preview, Bobtail Performance - I/O Scheduler Comparison and others are an excellent source of information.
CPU
Caph metadata servers dynamically redistribute their load, which is CPU intensive. So your metadata servers should have significant processing power (e.g., quad core or better CPU). Caph ODs run the RADS service, calculate data placement with CRUSH, replicate data, and maintain their own copy of the cluster map. Therefore, OSes should have a reasonable amount of processing power (e.g., dual core processors). Monitors simply maintain a master copy of the cluster map, so they are not CPU intensive. You must also consider whether the host machine will run CPU-intensive processes in addition to Cepheid daemons. For example, if your hosts will run computing V Ms (e.g., Open Stack Nova), you will need to ensure that these other processes leave sufficient processing power for Cepheid daemons. We recommend running additional CPU-intensive processes on separate hosts.
RAM
Metadata servers and monitors must be capable of serving their data quickly, so they should have plenty of RAM (e.g., 1GB of RAM per daemon instance). OSes do not require as much RAM for regular operations (e.g., 500MB of RAM per daemon instance); however, during recovery they need significantly more RAM (e.g., ~1GB per 1TB of storage per daemon). Generally, more RAM is better.
Data Storage
Plan your data storage configuration carefully. There are significant cost and performance tradeoffs to consider when planning for data storage. Simultaneous OS operations, and simultaneous request for read and write operations from multiple daemons against a single drive can slow performance considerably. There are also file system limitations to consider: barfs is not quite stable enough for production, but it has the ability to journal and write data simultaneously, whereas XIS and ext4 do not.
Important
Since Caph has to write all data to the journal before it can send an AC (for XIS and EXT4 at least), having the journal and SOD performance in balance is really important!
Hard Disk Drives
ODs should have plenty of hard disk drive space for object data. We recommend a minimum hard disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage of larger disks. We recommend dividing the price of the hard disk drive by the number of gigabytes to arrive at a cost per gigabyte, because larger drives may have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte hard disk priced at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the 1 terabyte disks would generally increase the cost per gigabyte by 40%–rendering your cluster substantially less cost efficient. Also, the larger the storage drive capacity, the more memory per One opportunity for performance improvement is to use solid-state drives (BSDs) to reduce random access time and read latency while accelerating throughput. SIDS often cost more than 10x as much per gigabyte when compared to a hard disk drive, but Suds often exhibit access times that are at least 100x faster than a hard disk drive.
Sods do not have moving mechanical parts so they aren’t necessarily subject to the same types of limitations as hard disk drives. Suds do have significant limitations though. When evaluating SIDS, it is important to consider the performance of sequential reads and writes. An SUD that has 400MB/s sequential write throughput may have much better performance than an SUD with 120MB/s of sequential write throughput when storing multiple journals for multiple ODs.
Important
We recommend exploring the use of Suds to improve performance. However, before making a significant investment in Suds, we strongly recommend both reviewing the performance metrics of an SUD and testing the SUD in a test configuration to gauge performance.
Since SIDS have no moving mechanical parts, it makes sense to use them in the areas of Cepheid that do not use a lot of storage space (e.g., journals). Relatively inexpensive SIDS may appeal to your sense of economy. Use caution. Acceptable OPS are not enough when selecting an SUD for use with Cepheid. There are a few important performance considerations for journals and SIDS:
Write-intensive semantics: Journalism involves write-intensive semantics, so you should ensure that the SUD you choose to deploy will perform equal to or better than a hard disk drive when writing data. Inexpensive SIDS may introduce write latency even as they accelerate access time, because sometimes high performance hard drives can write as fast or faster than some of the more economical Suds available on the market!
Sequential Writes: When you store multiple journals on an SUD you must consider the sequential write limitations of the SUD too, since they may be handling requests to write to multiple OS journals simultaneously.
Partition Alignment: A common problem with SUD performance is that people like to partition drives as a best practice, but they often overlook proper partition alignment with SIDS, which can cause SIDS to transfer data much more slowly. Ensure that SUD partitions are properly aligned.
While SIDS are cost prohibitive for object storage, OSes may see a significant performance improvement by storing an Osgood journal on an SUD and the Osgood object data on a separate hard disk drive. The osd journal configuration setting defaults to /var/lib/Cepheid/DOS/$cluster-$id/journal. You can mount this path to an SAD or to an SUD partition so that it is not merely a file on the same disk as the object data.
One way Cepheid accelerates Cepheus filesystem performance is to segregate the storage of Cepheus metadata from the storage of the Cepheus file contents. Cepheid provides a default metadata pool for Cepheus metadata. You will never have to create a pool for Cepheus metadata, but you can create a CRUSH map hierarchy for your Cepheus metadata pool that points only to a host’s SUD storage media. See Mapping Pools to Different Types of ODs for details.
Controllers
Disk controllers also have a significant impact on write throughput. Carefully, consider your selection of disk controllers to ensure that they do not create a performance bottleneck.
Tip
The Cepheid blog is often an excellent source of information on Cepheid performance issues. See Cepheid Write Throughput 1 and Cepheid Write Throughput 2 for additional details.
Additional Considerations
You may run multiple OSes per host, but you should ensure that the sum of the total throughput of your SOD hard disks doesn’t exceed the network bandwidth required to service a client’s need to read or write data. You should also consider what percentage of the overall data the cluster stores on each host. If the percentage on a particular host is large and the host fails, it can lead to problems such as exceeding the full ratio, which causes Cepheid to halt operations as a safety precaution that prevents data loss.
When you run multiple OS's per host, you also need to ensure that the kernel is up to date. See OS Recommendations for notes on glib c and sync's(2) to ensure that your hardware performs as expected when running multiple ODs.
0 comments:
Post a Comment