Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)
Conclusion: Hopefully this has illustrated some points about using and abusing tools like Hadoop for data processing tasks that can better be accomplished on a single machine with simple shell commands and tools.
This article is good for new programmers to understand why certain solutions are better at scale, there is no silver bullet. And also, this is from 2014, and the dataset is < 4GB. No reason to use hadoop.
The discussion we had here was involving TB of data, so I'm curious how this is faster with CLIs rather than parallel processing...
JQ is very convenient, even if your files are more than 100GB.
I often need to extract one field from huge JSON line files, I just pipe jq to it to get results. It's slower, but implementing proper data processing will take more time.
> Hadoop: bro
> Spark: bro
> hive: bro
> data team: bro