Top AWS Glue(2024) Interview Questions | CodeUsingJava






















Top AWS Glue Interview Questions


  1. What is AWS Glue?
  2. What are the Features of AWS Glue?
  3. How can AWS Glue manage ETL Service?
  4. What are the use cases of AWS Glue?
  5. What are the drawbacks of AWS Glue?
  6. How can we Automate Data Onboarding?
  7. How to list all databases and tables in AWS Glue Catalog?
  8. What is AWS Glue Data Catalog?
  9. How can AWS Glue Data Catalog access with Amazon Athena?
  10. What are AWS Glue Crawlers?
  11. What is the AWS Glue Schema Registry?
  12. How does Schema Registry be integrated?
  13. How can we solve this HIVE PARTITION SCHEMA MISMATCH?
  14. How to define nested array to ingest data and convert?
  15. How to execute aws glue scripts using python 2.7 from local machine?
  16. What is AWS Glue Streaming ETL?
  17. How set name for crawled table?
  18. How to specify join types in AWS Glue?

What is AWS Glue?

AWS Glue is a service which helps in making simple and cost effective for categorizing our data, clean it and move it reliably between various data stores and data streams.It consists of central metadat repository called as SWA Glue Catalog.AWS Glue helps in generating Python or Scala code, by handling dependency resolution, job monitoring, and retries.AWS Glue is serverless infrastructure for set up or manage, it is a component known as dynamic frame that will help us using in our ETL scripts.Dynamic Frame is same as Apache Spark dataframe and the data abstraction which is used for organizing data into rows and columns.

What are the Features of AWS Glue?

  • Automatic Schema Discovery - Allows in automating crawlers to obtain schema related information and also in storing in data catalog.
  • Job Scheduler - Several jobs can be started in parallel, and users can specify dependencies between jobs.
  • Developer Endpoints - helps in creating custom readers, writers and transformations.
  • Automatic Code Generation - helps in generating code.
  • Integrated Data Catalog - stores data from a disparate source in the AWS pipeline.

How can AWS Glue manage ETL Service?


AWS Glue


What are the use cases of AWS Glue?

The use cases of AWS Glue are as follows:
Data extraction - helps in extracting data in variety of formats.
Data transformation - helps in reformating data for storage.
Data integration - helps in interagting data into enterprise data lakes and warehouse.


What are the drawbacks of AWS Glue?

  • Limited Compatibility - used for working with variety of commonly used data sources and works with services running on AWS.
  • No incremental data sync - Glue is not the best option for real-time ETL jobs.
  • Learning curve - used for supporting queries of traditional relational database.


How can we Automate Data Onboarding?


AWS Glue


How to list all databases and tables in AWS Glue Catalog?


import boto3


client = boto3.client('glue',region_name='us-east-1')

responseGetDatabases = client.get_databases()

databaseList = responseGetDatabases['DatabaseList']

for databaseDict in databaseList:

    databaseName = databaseDict['Name']
    print '\ndatabaseName: ' + databaseName

    responseGetTables = client.get_tables( DatabaseName = databaseName )
    tableList = responseGetTables['TableList']

    for tableDict in tableList:

         tableName = tableDict['Name']
         print '\n-- tableName: '+tableName


What is AWS Glue Data Catalog?

AWS Glue Data Catalog is a persist metadata store used for storing structural and operational metadata for all data sets, also provides uniform repository where disparate systems helps in storing and finding metadata for keeping track of data in data silos.It uses metadata to query and transform the data.It also helps in tracking data that has changed overtime, is a drop in replacement for the Apache Hive Metastore for Big Data Applications running on AWS EMR.AWS Glue Data Catalog also helps by providing out of box integration with Athena, EMR, and Redshift Spectrum.

How can AWS Glue Data Catalog access with Amazon Athena?


AWS Glue


What are AWS Glue Crawlers?

AWS Glue Crawlers used for storing data and progressing through a prioritized list of classifiers for extracting the schema of our data and other statistics and populates the Glue Data Catalog with this metadata.They helps us by running periodically for detecting the availability for new data and also changes the existing data, including table definition changes.Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions.

What is the AWS Glue Schema Registry?

AWS Glue Schema Registry helps by enabling us for validating and controlling the evolution of streaming data using the registered Apache Avro schemas with no additional charge.Schema Registry helps in integrating with Java Applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.

How does Schema Registry be integrated?


AWS Glue


How can we solve this HIVE PARTITION SCHEMA MISMATCH?

If we are using crawler, we should select following option:
Update all new and existing partitions with metadata from the table


How to define nested array to ingest data and convert?

{
    "class_id": "test0001",
    "students": [{
        "student_id": "xxxx",
        "student_name": "AAAABBBCCC",
        "student_gpa": 123
    }]
}


How to execute aws glue scripts using python 2.7 from local machine?


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
persons = glueContext.create_dynamic_frame.from_catalog(
             database="records",
             table_name="recordsrecords_converted_json")
print "Count: ", persons.count()
persons.printSchema()


What is AWS Glue Streaming ETL?

AWS Glue helps in enabling ETL operations on streaming data by using continuously-running jobs.It can also be built on the Apache Spark Structured Streaming engine, and can ingest streams from Kinesis Data Streams and Apache Kafka using Amazon Managed Streaming for Apache Kafka.It can clean and transform streaming data and load it into S3 and JDBC data stores and can process event data like IoT streams, clickstreams, and network logs.

How set name for crawled table?

import boto3

database_name = "database"
table_name = "prefix-dir_name"
new_table_name = "more_awesome_name"

client = boto3.client("glue")
response = client.get_table(DatabaseName=database_name, Name=table_name)
table_input = response["Table"]
table_input["Name"] = new_table_name
# Delete keys that cause create_table to fail
table_input.pop("CreatedBy")
table_input.pop("CreateTime")
table_input.pop("UpdateTime")
client.create_table(DatabaseName=database_name, TableInput=table_input)


How to specify join types in AWS Glue?

cUser0 = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_users", transformation_ctx = "cUser")

cUser0DF = cUser0.toDF()

cKKR = glueContext.create_dynamic_frame.from_catalog(database = "captains", table_name = "cp_txn_winds_karyakarta_karyakartas", redshift_tmp_dir = args["TempDir"], transformation_ctx = "cKKR")

cKKRDF = cKKR.toDF()

dataSource0 = cUser0DF.join(cKKRDF, cUser0DF.id == cKKRDF.user_id,how='left_outer')