| SQL | 65% |
| Python | 60% |
| Data Pipelines | %55 |
| Data Warehouse | %50 |
| Hadoop | %45 |
| Hive | %45 |
| ETL | %40 |
| Spark | %40 |
| AWS | %30 |
| Redshift | %30 |
| Java | %25 |
| Kafka | %25 |
| MapReduce | %25 |
| Ruby | %25 |
| Scala | %25 |
| Vertica | %25 |
| Data Quality | %20 |
| JavaScript | %20 |
| NoSQL | %20 |
| Statistics | %20 |
Roles in a Data Organization
So what are the roles in a data organization? Data Engineers are the worker bees; they are the ones actually implementing the plan and working with the technology. Managers (both Development and Project): Development managers may or may not do some of the technical work, but they help to manage the engineers. Project managers help handle the logistical details and time-lines to keep the project moving according to plan. Data Architects are the visionaries. They lead the innovation and technical strategy of the product and architecture. Highly experienced and technical, they grow from an engineer position. Very experienced and valuable, they are rare ducks since they’ve essentially been working in this field since its beginning. When we surveyed several ‘Data Architect’ job descriptions on Glassdoor, LinkedIn and Indeed.com, we found many similarities to the skills required of Data Engineers, so let’s focus on the differences. The differences include things like coaching and leadership; data modelling, and feasibility studies. Another thing required of architects is a firm grasp of legacy technologies. Typically, “legacy” technologies mentioned include: Oracle databases, Teradata, SQL server and Vertica. In the resources below, we don’t cover much of these because there’s extensive documentation on them already.The Data Pipeline, described
In order to understand what the data engineer (or architect) needs to know, it’s necessary to understand how the data pipeline works. This is obviously a simplified version, but this will hopefully give you a basic understanding of the pipeline.
Common programming languages
are the core programming skills needed to grasp data engineering and pipelines generally. Among other things,
Java
and
Scala
are used to write MapReduce jobs on Hadoop;
Python
is a popular pick for data analysis and pipelines, and
Ruby
is also a popular application glue across the board.
Collection and ingestion
are tools at the beginning of the pipeline. Common open-source examples are:
Apache Kafka
,
Fluentd
and
Embulk.
This stage is where data is taken from sources (among them applications, web and server logs, and bulk uploads) and uploaded to a data store for further processing and analytics. This upload can be streaming, batch or bulk. These tools are far from the only ones; many dedicated analytics tools (including
Treasure Data
) have SDKs for a range of programming languages and development environments that do this.
Storage and management
are typically in the middle of the pipeline and take for form of Data Warehouses, Hadoop, Databases (both RDBMS and NoSQL), Data Marts and technologies like Amazon Redshift and Google BigQuery. Basically, this is where data goes to live so it can be accessed later.
Data processing
is typically at the end of the pipeline. SQL, Hive, Spark, MapReduce, ELK Stack and Machine Learning all go into this bucket and are used to make sense of the data. Are you querying your data into a format to use for visualization (like Tableau, Kibana or Chartio)? Are you formatting your data to export to another data store? Or maybe running a machine learning algorithm to detect anomalous data? Data processing tools are what you’ll use.
The point is, when you see a job ad or recruiter referring to a specific technology, make it a goal to understand what the technology is, does, and what part of the data pipeline it fits into.