dataproc logo

What is Google Dataproc?

Google Dataproc is a Managed Apache Spark and Hadoop service from Google Cloud. It is fast, easy to use and low cost.

Google Dataproc contains Apache Hadoop, Apache Spark, Apache Pig and Apache Hive. Google Dataproc is well integrated across the Google Cloud Platform with connectors and api’s making it easy for you to create a complete data processing and analytics platform.

It has built-in integration with BigQuery, CloudStorage, Bigtable, Stackdriver Logging and Stackdriver Monitoring.

For example you can easily extract-transform-load terabytes of data directly into BigQuery for reporting or run a map-reduce job and store its results in Bigtable.

You can access and perform operations in Dataproc in three ways

Creating a Cluster

The easiest way to create a Dataproc cluster is by web console. Below are the steps to create Dataproc cluster.

  1. Go to the Cloud Platform Console Cloud Dataproc Clusters page.
  2. Click Create cluster.
  3. Enter your cluster name in the Name field.
  4. Select a zone for the cluster from the Zone list.
  5. Leave the rest of the options as it is.
    create google dataproc cluster in web console
  6. Click Create to create the cluster.

You new cluster will appear in the cluster list as “provisioning”. Once the provisioning is completed it will be in running state.

Now you can start using your Dataproc cluster to integrate with other services to implement your data processing platform.

Source: cloud.google.com