Big data & small files problem in HDFS

Eyal Yoli Abs
DataDrivenInvestor
Published in
5 min readJan 20, 2020

--

Lost person looking for answers. Image by VisionPic .net

Everyone these days is talking about “Big data” but does everyone know what makes big data big? Just try and read the definition of the phrase “Big data” on Wikipedia; “a field that…deal with data sets that are too large or complex to be dealt with…”, you won’t find anywhere the numbers that make data too large or complex.

Defining “big” in data

As the team leader of a one-year-old big data team, we didn’t comprehend the definition of BIG, our organization didn’t!

The purpose of the project was to process log data from different sources to raise the security bars of our networks. Log data is mostly composed of events and are on average 500 bytes. Some are even 200 bytes.

Beforehand, we had stability problems in Hadoop HDFS because we overlooked a basic assumption; HDFS is programmed for large files (1 GB and more). When instability struck us, we started having second thoughts about what makes us big data! After all, we come from a big-data-less world (and organization), it was and always will be hard to build a firm knowledge base to cling to.

We configured Flume (our ingestion tool) to ingest data in 5 minutes batches and write those down to HDFS, with 500 bytes per event; inputs ranged between 15 events/sec to 10,000 events/sec, we had numerous files ranging between 8kb to 140kb. When taking into account the data from all inputs we get an input rate of ~55GB/hour, which felt enough to be considered big data.

On our puzzled way in search of any absolute answers, I came across this blog, although it is rough in language, it is concise and will save you valuable time and money when considering going BIG (data).

The page specifically discusses adopting Hadoop’s eco-system as your big data platform. Even though the essay provides some numbers that give us a clearer picture, these numbers are inaccurate from different viewpoints and these should be taken into account.

First of all, the page was written in 2013, when Hadoop was at version 2.2.0. Just to get things straight, today we are at version 3.2, so things have changed.

Second, if we substitute the numbers on the page with the definition in Wikipedia, we can say the author considered 5TB+ as a data set that is too large\complex to process. 5TB certainly is large, but large and complex are relative adjectives to other data in the digital world and to the processing power we have. The following forecasts will make it definite that these change rapidly:

1. In only a year, the accumulated world data will grow to 44 zettabytes (that’s 44 trillion gigabytes)! For comparison, today it’s about 4.4 zettabytes. retrieved from hostingtribunal.com.

2. Moore’s law says computing power will double every two years (holds until 3nm). retrieved from Wikipedia.

In conclusion, 5TB today is probably more than 20TB (using the latest tech disregarding software limitations).

Additionally, in most cases, you’ll probably stream new data (or load deltas every X time), so you’ll have a growth factor to your data. This is relevant to consider if the growth is reflected in the datasets that you or your client query.

Growth in the dataset can come from several sources. It can be formed when you wish to extend the time range of your query, another source of growth can develop from the same input (in the rate of messages sent) and finally, by adding additional data inputs that you wish to process

It is essential to understand what is the definition of your dataset.

When we understood better our clients’ datasets we found that the various research questions needed data based on days, months or even years. These data sets vary from 100MB to 100TB of processed data. So this is big!

Unfortunately, for Hadoop, less than 1GB for files is abusive, thus, we still have a problem with the way we store our data.

Even though the problem still exists, answering the BIG question was substantial and needed revisioning to verify that our requirements didn’t change beneath our feet.

Solving small files for HDFS

At this point in the project, where there are already active clients using data stored on Hadoop we simply couldn’t move to Cassandra (which copes better than Hadoop with small files — I figured that choosing Hadoop over Cassandra wasn’t the most suitable decision for our initial requirements, but it was all in hindsight). Hence we needed to find a simpler solution based on the same architecture that is backward compatible with our current directory structure.

That’s when we got the idea to aggregate the data in an hourly manner to get the most out of the current indexing convention (check out the previous story), to have the biggest single file we possibly could in every hour.

It is a simple program that reads the data written by Flume from a specific (past) hour and writes them to one Parquet file in another parallel directory structure. This optimization saved us a lot of space and file count in HDFS which in turn enhanced the performance and stability of HDFS as a result of lowering the RAM usage of the nodes. You can check out these benchmarks between the different formats, it is a basic knowledge that we, data engineers, must know!

This uncomplicated strategy operated so well and is why we expanded its functionality to support diverse stream application outputs and become more flexible with the input/output directory structure.

We evaluated this solution would hold for merely a year or two but it turned out its lifespan held beyond what we foresaw. Moreover, as a result of decoupling raw data and published data formats, it provided us with flexibility. We started with just plain text and JSON formats, afterward, we moved to Parquet as the final published data format for our clients, while keeping the flexibility of raw data format (which we use the better-fit format to the input data type), the app did the conversion between the two. Later when we added new formats it was straightforward to implement the conversion in the aggregation app.

Rethinking your project’s assumptions and basis is a beneficial to strengthen your belief in the actions you take and the decision you make. Just remember when making these decisions, be on the simplest-easiest side to retain future flexibility.

--

--

A software engineer, leading a big data team(s) in the cyber dimension.