Looking at the world in general, the main disruptors for the medium term future would seem to be: An aging population, especially in the West; The Internet of Things or IoT; Advanced Robotics; Focus on Privacy, both at a personal level and regulatory level; Disaster Recovery and Business Continuity including combating Ransomware. 1Our challenge is to investigate these disruptors and try to figure what they will mean for us in terms of future directions of storage.
The amount of stored data continues to grow at an alarming rate, at least in the Open systems world. IoT should certainly fuel future growth, but savvy people are now a little worried about the size of the 'digital footprint' that they are leaving behind them, and the wealth of detail that it contains about them so this could slow data growth in some areas. However, storage futures continues to be driven by data growth, cost containment and regulatory compliance. How do you plan for this when there is a bewildering array of products and services to chose from, all of which promise to fix all your issues with a minimum of effort?
These future possibilities can be split up into three areas:
The biggest upcoming change for capacity provision will be storage automation. An authorised user will be able to request and obtain (and pay for) capacity on demand, with automation doing all the provisioning in the background. The requestor will select the storage service they need based on some simple criteria, like capacity, resilience, cost and performance, with the data management, mirroring and backups happening in the background. This could well happen in a local data center, and is already happening for applications based in the Cloud.
If you decide to purchase capacity yourself, then you need to consider the trade-off between capacity, performance and cost. High speed storage usually comes in smaller increments and is more expensive. The trick is to make sure that the data that needs best performance is on that high speed storage, while older data can be on a cheaper media. The future of primary storage is flash disk connected by NVMe.
Flash Storage continues to be the favourite for fast access, and is rapidly replacing hard disk drives. Every vendor supplies all-Flash arrays now, and most will supply hybrid mixed disk and flash systems, but there are very few storage systems now that only host hard disk drives. The price of NAND flash is expected to fall in 2019 due to oversupply in the market, possibly by 30% or more. 3D NAND is also pushing prices down, with 64 layer expected to dominate in 2019 while 96 layer will slowly enter the market too. With multi-terabyte Flash drives already in production, the disk drive future looks very uncertain. However it looks like Flash storage itself is coming to the end of its development capabilities, and new technologies are planned to supplement it.
Intel's Optane uses Prototype Phase change or PRAM, which at a very simplistic level, works by changing the state of chalcogenide glass as the two states have different electrical resistance. Optane drives are available in 2019, but sizes are small at 280 or 480GB. They are currenly used as a high speed cache between the processor and storage, Flash or HDD. They have some way to go to supplant Flash.
Optane itself has some developing competition from Crossbar, which has designed a ReRAM device which it claims is 1,000x faster than NAND flash has 1/20th the power consumption, and lasts 1,000x longer. Optane is available now on PCs, as a cache between hard drive storage and main memory and is getting good reviews. We can expect it to be more prevalent in 2019, and hopefully the cost will come down too.
There are plans to store data on DNA strands, which would be at molecular level. This promises very high capacity density, but if it ever becomes a mainstream storage product, then a lot more development is needed. CATALOG Technologies is developing an implementation that uses standard DNA building blocks, or pre-made DNA molecules, which they say is a faster and cheaper way of building a datablock that assembling the molecules individually as required. One to watch, but unlikely to surface unti the 2020s.
One of the problems with producing faster technology is that the storage devices can process data faster than the existing comms channels that serve them, so the comms channels become a bottleneck. I'm certainly hearing this from Mainframe performance experts, who tell me that Flash drives have resolved the problems with disk performance, but the FICON channels are now the problem. There are two angles to consider here, the interfaces that exist within a server or device, and the external connections and channels.
PCI Express, or PCIe, provides the fast interface and internally, NVMe is an alternative to the old SCSI protocol for transferring data between hosts and peripheral storage. NVMe was designed specifically to support the high IO rates demanded by PCIe connected Flash Storage and it is generally accepted that NVMe will eventually supersede SATA SSDs.
There are physical limits to how fast you can send a signal down a wire but what you can do, is use more wires in parallel. The m.2 PCIe device interface allows a device to connect to 2 or 4 PCIe lanes and as this scales up, it should cope with the fastest Flash Storage transfer rates, but this is a motherboard connection. It should resolve the comms issues inside the storage device, but will not help with the external channels.
The next stage is NVMe over Fabrics (NVMe-oF), which is intended to improve the data transfer between host computers and target storage systems. At the moment, NVMe-oF just provides remote direct memory access (RDMA) over converged Ethernet (RoCE) and Fibre Channel (NVMe-FC). In 2019, several all-flash array vendors such as Kaminario, Pure Storage and Western Digital's Tegile have moved to use NVMe-oF as a back-end fabric. The expectation is that NVMe will expand further in 2019, making inroads into storage systems, servers, and SAN fabrics. Initially it is expected that NVMe-of will just be used in a small scale connecting components in maybe a couple of racks, but eventually it will probably supplant SCSI as the connectivity protocol of choice.
Magnetic Disk is not completely dead, two technology improvements exist which could extend their life a bit. Heat assisted magnetic recording (HAMR) and Microwave assisted magnetic recording (MAMR) both use techniques to persuade the magnetic domains to change polarity faster, by using a laser (HAMR) or a microwave generator (MAMR) in the write head. This allows the data density on a disk to improve by a factor of two or four, and so reducing the cost per terabyte.
Magnetic tape does seem to be falling behind, with the current LTO-8 tape holding just 12TB raw storage capacity, though there is a routemap to LTO-10 with a 48TB capacity. Magnetic tape is still the best product for cold archive data, as the slower access is offset by the cheaper price per TB. Tape backups are reliable and inexpensive, and once the data is written to tape, it cannot be altered in situ, making it the best solution for recovery from ransom ware attacks.
The Cloud can be defined as a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer. A lot of smaller businesses are now using the cloud to reduce their storage costs, often using SaaS applications. A lot of the pressure to move to the Cloud comes from COEs or other senior management, rather than being driven by the IT department. One reason for this pressure is that as data moves to the Cloud, the cost moves from CapEx to OpEx, which is always preferred by accountants.
As the business usage of the Cloud matures, consistent management of data between Clouds and data movement between Clouds will become critical. Most companies now use more than one Cloud provider, but it is important to make sure that specific applications are held on the most cost-effective Cloud platform.
Many companies have on-premises private clouds, but your biggest challenge might be getting your data back out of a public Cloud if you need to. If you use a Cloud provider, find out what options you have to extract your data. What you need is a storage management tool that can transparently move data from on-premises configurations to public clouds and across private cloud deployments. You can then benefit from the performance advantages of a private cloud, but also the savings public clouds drive for backup and archival data.
The other requirement is policy-based data management with a common set of rules for data retention, protection and access control over different Clouds. However the problem with implementing this is that different Cloud providers use different semantics and formats for their cloud object stores.
Enter multi-cloud data management products like Vizion.ai, Swiftstack, Scality Zenko, NooBaa, Rubrik Polaris GPS and Cohesity Helios. These products are intended to fix the issues above. They provide a mixture of: a single namespace for the stored files or objects, search and analysis facilities, global policy management and migration facilities. You would need to check out individial products to see what each one delivers.
One thing multi-cloud data management tools could do better is to make performance recommendations, and recommendations for tiering and usage levels across cloud providers. That analytical element may be the next step in multi-cloud data management.
Getting data in and out of the cloud can take some time, seconds for small amounts of data, and hours for Big Data. This is beginning to becone a problem, especially for the Internet of Things. Enter Edge computing and Fog computing. You can wait a second of two for a response from Amazon Echo, but we want a device like a driverless car to respond instantly. For this to happen you need processing power and storage on the device itself, or at the 'edge'.
If you consider the internet to be a bit like a spider's web, then the Cloud would be the computing and storage 'spider' in the center, and the web extends out to all the connected things. The idea behind Edge computing is that storage and processing is provided at the 'edge' of the web, to reduce the amount of raw data that would need to be passed over the web, and speed up processing for the things. Now pardon me from being a cynic, but is that not just the way things used to work before the Cloud came along. Could it be that the cloud was over hyped, and cannot provide fast enough response times for many applications? But, rather than admit they got things wrong, the vendors and planners have to come up with a new term for computing outside the Cloud, so let's call it the Edge.
Fog computing provides the same functionality, but could be a little nearer the users than the Edge, or it could be the same as the Edge. As yet there is no agreed definiton for this. Edge devices can be both small, low-cost cluster hardware in an SME, or server farms with clustering and large scale storage networks in a very large corporation.
Edge processing is expected to grow in 2019, which will mean that companies must provision and manage data storage for them. If your cloud, public or private, spans multiple cities or countries, then this could be a challenge.
You could almost define Blockchain Storage as the Cloud on steroids. Imagine a scenario where your data is stored on dozens of individual storage units around the globe, accessable via the internet, but with no central control point. The starting point for understanding this is a blockchain.
A Blockchain is a distributed ledger or database that records transactions between two or more parties and maintains details about each transaction, where each transaction is added to the ledger in chronological order. The data is stored as a series of blocks and each block references the preceding block to form an interconnected chain. This looks like a SPOF at first sight, but the ledger is distributed across multiple nodes, with each node maintaining a complete copy. As every node has a copy of the ledger, and has full access to add blocks and verify blocks, there is no need for a central authority or third-party verification service.
So a blockchain can be used to store data in a distributed and geographically dispersed way, where the data is stored as blockchain nodes. The basic process goes like this
The storage subsystem breaks the data up into blocks, or 'shards'
It then encrypts each shard, base on a key provided by the data owner
It generates a unique hash value for each shard, and records it in the shard metadata and the ledger
It creates redundant copies of the shard, with the number of copies and locations controlled by the data owner
A P2P network then distributes the replicated shards to several storage nodes, which can be either distrubuted regionally or globally for data resilience. It is expected that the nodes will be owned by various organisations or even individuals, who will lease out the storage space. However the data will be spread between several storage owners, so only the content owners have full access to all their data.
Finally, the storage subsystem will record transactions in the blockchain ledger, then sychronise the ledger over all the participating nodes.
We might see blockchain storage taking off in 2019, although there are some reservations about internet bandwidth. The advantage of using blockchain storage should be that because it is based on blockchain technology, it is verifiable, traceable, tamper-proof and controlled by the data owner.
What is the future for backup and recovery services? Applications can span to several terabytes and while it is possible to back this up using traditional methods from snapshots, it would take several hours to recover an application from tape. That is pretty much unacceptable for most of today's businesses. For me, the future seems to be snapshots. The EMC DMX3 storage subsystem can snapshot a whole application with a single command, maintain up to 256 of those snapshots for each application, and restore the source volumes from any one of those snapshots. Of course, most of the time you don't want to restore the whole application, just a few files. So you mount the relevant snapshot on a different server and copy over the files that you want.
If you lose the whole subsystem you lose the source data and all the snapshots, so to fix that you need a second site with remote synchronous mirroring between the two. This is not the future of course, you can do all this now. I think the future is that backup and recovery applications will start to recognise that they do not need to move data about to create backups, but can use snapshots and mirrors as backup datastores. The role of the software would then be to manage all that hardware and maintain the necessary catalogs that refer to backups contained in all these snapshots so the storage manager can easily work out what backups are available and also recover from them with simple commands.
These snapshots and replicas can be used as read only data for development and testing, and even for updates once they are no longer required for backups. This means the extra capacity required for these copies is not an overhead, but can be used as an asset.
We may see standard backups using the cloud as a longer term repository, with on-site snapshots retained for large application restore or user error type restores from recent backups. As more important data is moved to the cloud, it too needs to be protected. Cloud-to-cloud backup, where data is copied from one cloud service to another cloud, will be important in 2019. Backup vendors will need to add cloud-to-cloud capabilities to satisfy this requirement. Specifically, they will need to add tools to back up and restore applications within the cloud.
Beware of vendors who tell you to just move your data to the cloud, and then backup and recovery is sorted as the data is replicated between dispersed data centres. Remember that if the data is replicated in real time (synchronous), then deletes and data corruption will be replicated too. Ask your cloud provider how they cope with this, and also how they cope with recovering versions of data from previous days.
Ransomware malware is picked from an infected email attachment or website. It then encrypts your data and demands money for the decryption key. Ransomware attacks, such as WannaCry and Petya have been big news in the last year or so. Victim organisations have two choices; pay the ransom or take a lot of downtime while fixing the problem. Many companies emphasise education, informing all employees of the risks and warning not to open unsolicted attachements.
However, Backup and Recovery vendors are now adding ransomware protection to their products and this will continue in 2019. Your backup and recovery product can help in various ways; by detecting suspicious application behavior before files are corrupted, with ransomware monitoring and detection tools, or by using predictive analytics to determine the probability that ransomware is operating on a server. Companies that are doing this now include Acronis, Druva, Unitrends and Quorum and more will surely follow in 2019. Of course, don't overlook tape. Tapes are 'write once read many', so once a tape backup is created, it cannot be encrypted by an outside agency.
Metadata Intelligence, the process of using metadata to manage data, is being touted as an exciting new way to get on top of managing your data. Of course, Mainframes have been using metadata like this for 30 years or more, the point is that Windows is starting to catch up. Metadata lets you see when a file was last opened and with this information, you can keep current data on fast flash storage and move older data off onto cheaper storage.
The EU has recently introduced the General Data Protection Regulation (GDPR) legislation, which dictates how personal data must be stored, processed and deleted when the 'right to be forgotten' applies. Metadata Intelligence will help manage this, as data can be automatically stored and deleted based on pre-determined rules.
The requirement to store data securely will mean that data copies must be geographically dispersed, especially for long term archived data.
Artificial Intelligence, or AI, links into this. It is often used to detect ransomware viruses, but it can also be used to analyse your estate and make intelligent recommendations. An example of a product suite would be Igneous Systems, with DataDiscover to analyse and record your data, DataProtect to back it up, and Data Flow to move it round the system as required. You can also use Imanis Data Management Platform 4.0's SmartPolicies to generate backup schedules which are based on a desired recovery point objective as set by the user.
When storing data, we normally use two storage tiers, maybe flash storage and HDD, with tape as a possible third tier. The problem is that the investigation and coding needed to manually manage three tiers is not trivial. However with AI doing all the work, it becomes possible to add and manage even more tiers, to get the optimum balance between performance and cost for different classes of data.
In more general terms, AI will be used for a wide range of processes in the future and all these processes will need lots of data, which must be stored securely and be accessable with the best of performance, especially if AI is working in real time.
The various disks, disk arrays, switches and other bits of the storage estate generate lots of data describing the current health of the product. Predictive Storage Analytics is about continuously analysing all those data points, to predict the future behaviour of the storage estate. The theory is that this can include pinpointing potential developing problems, such as defective cables, drives and network cards, then alerting support staff, with a precisely located problem and a recommended solution. One of my least favourite error messages goes something like 'An unidentified System Error has occurred'. I'm not sure how that would be pinpointed.
Predictive Storage Analytics would also be able to monitor storage pools, cache, CPU and channel utilisation and recommend capacity requirements now and in the future.
HCI is designed to reduce data center complexity and increase scalability. A workable definition of Hyper Converged Infrastructure could be "HCI is a single system framework that combines storage, computing and networking". HCI platforms typically run on standard, off-the-shelf servers and include software-defined storage, a hypervisor for virtualized computing and virtualized networking. Several hypervisor nodes can be clustered together to create pools of shared compute and storage resources and they can include pre-configured monitoring, backups, networking and storage configuration. Extra resources can be dynamically allocated as needed, without requiring system downtime.
HCI would typically by introduced as part of a data center modernization projects, and would provide a company with the scalability and cost benefit of a public cloud infrastructure without having to give up the control element of having hardware on their own premises. HCI can be hardware based with an integrated HCI appliance from a single vendor, or it can be software based and so hardware-agnostic.
A hardware approach will use commodity components and will be supported by a single vendor. The advantage of this approach is that you get an infrastructure that should be easier to manage and more flexible. The disadvantage is that you are locked into a single supplier. These HCI systems were initially targeted at general-purpose workloads with fairly predictable resource requirements such as virtual desktop infrastructure. They are now used for more unpredictable applications, such as Oracle and SQL servers, file and print services, and web servers.
A software based solution lets you deploy HCI on your own technology.There are obvious initial cost savings with this approach, and also potental ongoing savings, as you can negotiate upgrade deals with several suppliers. The downside, of course, is some loss in simplicity. HCI software vendors include Maxta and VMware (vSAN).
Composable infrastructure is an alternative to HCI, but Unlike HCI, which uses a hypervisor to manage the virtual resources, composable infrastructure uses APIs and management software to both recognize and aggregate all physical resources into the virtual pools and to provision, or compose, the end IT products. It didn't quite live up to the hype in 2018, but it may take off in 2019 as more vendors are putting out composable infrastructure products.
No-one is sure exactly how this one will play out yet, but the basic quantum data storage block is a 'qubit', which is an entangled particle that can simultaneously be both a one and a zero. A computer that uses qubits can process data much faster than a traditional computer. The problem with qubits is that they are very transient, with a storage time of just over a second at best.
An intruiging property of entanglement is that it can happen at a distance, and is instantaneous. To spell this out, this means that the speed of light is not a limitation anymore, remote mirroring is instant. Another advantage is that it is almost hack-proof as it uses a quantum key for encryption.
Quantum computers exist now, IBM has one online, but they are some way from being a commercial proposition yet.