With the constant increasing of the quantity of data that companies collect and need to process, Data Warehousing is a job sector that’s expnding even in the recession. It it also living a second youth, thanks to a number of open source projects that have been slowly but surely gaining popularity in a manner similar to linux 10 years ago. One of this technologies is Hadoop, a distributed filesystem and data processing framework based on Google’s Map/Reduce paper. Hadoop powers Yahoo! Search, Facebook and many other sites’ data warehouses. If you’re thinking about learning more about Data Warehousing, I have 2 books to recommend. The first one covers the basics concepts and terminology in data warehousing, the second covers the new kid on the block, Hadoop.
The Data Warehouse ETL ToolkitThis book covers the general concepts and the terminology you need to know – there’s no code, nothing specific to any system. It assumes you’re using some kind of relational database, and some kind of tool to do your ETL (Extract, Transform, Load). Walks you over all the possible processes your data may need to go through, as well as the possible problems. |
|
Hadoop: the definitive guideThis is an excellent book that provides an in-depth theoretical explanation of hadoop and its concepts, and its internals. It has a lot of material and it’s going to be useful to the novice as well as to the expert. It covers all the internals of Hadoop – input formats, compression, splits – as well as the more mundane and practical aspects like installation, administration, monitoring. It doesn’t cover much of other tools that are based on hadoop (hive, hbase) but it does give you an idea of how they relate to each other. |
Posted on October 27, 2009 by Roberto Congiu
0