美军基德号驱逐舰:Getting a Handle on Big Data with Hadoop

来源:百度文库 编辑:偶看新闻 时间:2024/05/02 19:37:57

Getting a Handle on Big Data with Hadoop

The flood of information from social media and elsewhere is propelling companies' use of free and customizable software called Hadoop to manage it

By Rachael King

Wal-Mart Stores, struggling to translate its brick-and-mortar success to the Web, is using free software named after a stuffed elephant to help it gain an edge on Amazon.com in the $165.4 billion U.S. e-commerce market.

As customers flock to social media, Wal-Mart (WMT) expects sites such as Facebook and Twitter to play a bigger role in online shopping. By analyzing what social network users say about products on those sites, the world’s largest retailer aims to glean insights into what consumers want.

With its online sales less than a fifth of Amazon’s last year, Wal-Mart executives have turned to software called Hadoop that helps businesses quickly and cheaply sift through terabytes or even petabytes of Twitter posts, Facebook updates, and other so-called unstructured data. Hadoop, which is customizable and available free online, was created to analyze raw information better than traditional databases like those from Oracle (ORCL).

“When the amount of data in the world increases at an exponential rate, analyzing that data and producing intelligence from it becomes very important,” says Anand Rajaraman, senior vice-president of global e-commerce at Wal-Mart and head of @WalmartLabs, the retailer’s division charged with improving its use of the Web.

Walt Disney (DIS), General Electric (GE), Nokia (NOK), and Bank of America (BAC) are also using Hadoop. The software can be applied to a variety of tasks including marketing, advertising, and sentiment and risk analysis. IBM (IBM) used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.

Wal-Mart’s Big Bet

For all its girth in retail stores, Wal-Mart’s online operations—started more than a decade ago—are still dwarfed by Amazon.com (AMZN). According to analysts at Wells Fargo Securities, Wal-Mart has about $6 billion in online sales, compared with Amazon.com’s $34.2 billion in 2010 revenue.

The retailer is making a big bet on Hadoop, so-called open-source software that was started by a group of Yahoo! (YHOO) developers. One of the challenges of Hadoop is getting it all to work together in a corporation. Hadoop is made up of a half-dozen separate software pieces that require integration to get it to work, says Merv Adrian, a research vice-president at Gartner (IT). That requires expertise, which is in short supply, he says.

Still, Hadoop is riding the “big data” wave, where the massive quantity of unstructured information “presents a growth opportunity that will be significantly larger” than the $25 billion relational database industry dominated by Oracle, IBM, and Microsoft (MSFT), according to a July report by Cowen & Co.

This year, 1.8 zettabytes (1.8 trillion gigabytes) of data will be created and replicated, according to a June report by market research firm IDC Digital Universe and sponsored by EMC (EMC), the world’s biggest maker of storage computers. One zettabyte is the equivalent of the information on 250 billion DVDs, according to Cisco Systems’ (CSCO) Visual Networking Index.

Data Spending Growth

The increasing popularity of Hadoop software also mirrors the growth in corporate spending on handling data. Since 2005, the annual investment by corporations to create, manage, store, and generate revenue from digital information has increased 50 percent to $4 trillion, according to the IDC report.

About 80 percent of corporations’ data is the unstructured type, which includes office productivity documents, e-mail, Web content, in addition to social media. By contrast, Oracle sells companies its Exadata system to manage huge quantities of structured information such as financial data.

“Hadoop plays in a much larger market than Exadata and is a materially cheaper way to process vast data sets,” says Peter Goldmacher, an analyst at Cowen & Co. in San Francisco.

Based on Oracle’s earnings conference call in June, analysts including James Kobielus at Forrester Research expect Oracle to make a Hadoop-related announcement in October. Oracle declined to comment.

Web companies were the first to face the big-data challenge now confronting large corporations. In 2004, Google (GOOG) published a paper about software called MapReduce that used distributed computing to handle large data sets.

Hadoop’s Creator

Inspired by Google’s paper and some software that his employer Yahoo had developed, Doug Cutting created Hadoop, named after his son’s stuffed elephant, in 2006. Cutting now works at a company called Cloudera that offers Hadoop-related software and services for corporations. Its customers include Samsung Electronics, AOL Advertising, and Nokia.

“It was obvious to me that the problems that Google and Yahoo and Facebook had were the problems that the other companies were going to have later,” says Cloudera Chief Executive Officer Mike Olson.

While Yahoo developers have contributed most of the code to Hadoop, it’s an open project, part of the Apache Software Foundation. Developers around the world can download and contribute to the software. Other Hadoop-related projects at Apache have names such as Hive, Pig, and Zookeeper.

Some of the original Yahoo contributors to Hadoop have formed a spinoff called Hortonworks to focus on development of the software. The company expects that within five years more than half of the world’s data will be stored in Hadoop environments.

Social Commerce

Wal-Mart, recognizing that the next generation of commerce would be social, purchased startup Kosmix for $300 million in April to create @WalmartLabs.

The acquisition gave it immediate expertise in big data: Kosmix co-founders Rajaraman and Venky Harinarayan co-founded Junglee, the company that pioneered Internet comparison shopping in 1996 and was later purchased by Amazon. At Kosmix, they also built something called the Social Genome, which uses semantic-analysis technology and applies it to a real-time flood of social media to understand what people are saying.

For now, @WalmartLabs uses Hadoop to create language models so that the site can return product results if the shopper enters a related word. For example, if somebody searches for a “backyard chair” on Walmart.com, the site will return results for patio furniture. In the future, Wal-Mart may be able to return styles of patio furniture most likely to appeal to a particular shopper based on his tweets and Facebook updates.

The company also uses Hadoop in its keyword campaigns to drive traffic from search engines to Walmart.com. The software collects information about which keywords work best to turn Internet surfers into shoppers, and then comes up with the optimal bids for different words.

“Our keyword campaigns include millions and millions of keywords on Google and Bing and we vary the bids we set on almost a real-time basis,” says Rajaraman.

Nokia’s Data Trove

Nokia is another company that recently recognized the treasure trove of data it’s sitting on.

“Sixty percent of the world’s population has mobile devices, and Nokia has 25 percent of those mobile customers,” says Amy O’Connor, senior director of analytics at Nokia. “Over the course of the past year we realized we had all this data we could use competitively.”

For example, Nokia collects information for its Navteq mapping service that it sells to large businesses. The company can tap into data from probes and mobile devices around the world to collect data on traffic. To figure out information about a particular street, the company used to have people weed through hundreds of terabytes of data.

“It was a manual process before Hadoop,” O’Connor says.

Now that the company is taking advantage of this unstructured information, the amount of data that it manages is skyrocketing. Over the next year or so, O’Connor anticipates that Nokia’s network will be handling as much as 20 petabytes (20 million gigabytes) of information, up from several hundred terabytes managed over the past year.

“The tsunami of data is not going to stop,” O’Connor says.