Top AWS Athena(2021) Interview Questions | CodeUsingJava






















Top AWS Athena Interview Questions


  1. What is AWS Athena?
  2. How can we create first query using Athena?
  3. How can we Query any data source with Amazon Athena?
  4. What are the Features Of Athena?
  5. How can we create Athena database and table?
  6. How can we Automate Data Onboarding?
  7. How to get input file name as column in AWS Athena external tables?
  8. How can we convert CSV files to Parquet using AWS Athena?
  9. How can we monitor and manage Amazon QuickSight and Athena in our CI/CD pipeline?
  10. How to tune your Performance of Athena?
  11. Why is Athena most suitable for Data Analytics?
  12. What are the various file formats in which data is stored in?
  13. How can we enter crawler name?
  14. How to Create Dataframe from AWS Athena using Boto3 get query results method?
  15. How to create Athena database via API?

What is AWS Athena?

AWS Athena is used for performing database automation, parquet file conversion, table creation, snappy compression, partitioning, and more.It act as an interactive service for analyzing Amazon S3 data by using standard SQL.The user can point athena at data stored in AWS S3 and also helps in executing queries for getting results using standard SQL.Amazon Athena scales executing queries in parallel, scales automatically, providing fast results even with a large dataset and complex questions.

How can we create first query using Athena?

select distinct costdb.cost.productname
from costdb.


How can we Query any data source with Amazon Athena?


Amazon Athena


What are the Features Of Athena?

Features of AWS Athena are:
  • Easy Implementation - Athena can be accessed directly from AWS Console and also directly by AWS CLI.
  • Serverless - AWS Athena can be for taking care of all the things on its own.
  • Pay per query - We can compress the data set by using AWS Athena.
  • Fast - Athena helps in performing complex queries in less time by breaking the complex queries into simpler ones and also can run them parallelly.
  • Secure - All data can be stored in S3 buckets and IAM policies can also help in managing control to users.
  • Highly available - It is highly available and can execute queries round the clock.
  • Integration - AWS Athena helps us in creating better versioning of data, better tables, views, etc.

How can we create Athena database and table?

  • Limited Compatibility - used for working with variety of commonly used data sources and works with services running on AWS.
  • No incremental data sync - Athena is not the best option for real-time ETL jobs.
  • Learning curve - used for supporting queries of traditional relational database.


How can we Automate Data Onboarding?

create database if not exists costdb;
    create external table if not exists cost (
        InvoiceID string,
        PayerAccountId string,
        LinkedAccountId string,
        RecordType string,
        RecordId string,
        ProductName string,
        RateId string,
        SubscriptionId string,
        PricingPlanId string,
        UsageType string,
        Operation string,
        AvailabilityZone string,
        ReservedInstance string,
        ItemDescription string,
        UsageStartDate string,
        UsageEndDate string,
        UsageQuantity string,
        Rate string,
        Cost string,
        ResourceId string
    )    
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    with serdeproperties (
        'separatorChar' = ',',
        'quoteChar' = '"',
        'escapeChar' = ''
    )
    stored as textfile
    location 's3://technology-aws-billing-data/athena'


How to get input file name as column in AWS Athena external tables?

We can do this with the $path pseudo column:
select "$path" from table


How can we convert CSV files to Parquet using AWS Athena?

import sys
import boto3
from awsAthena.transforms import *
from awsAthena.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsAthena.context import AthenaContext
from awsAthena.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
AthenaContext = AthenaContext(sc)
spark = AthenaContext.spark_session
job = Job(AthenaContext)
job.init(args['JOB_NAME'], args)


client = boto3.client('Athena', region_name='ap-southeast-2')

databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName

Tables = client.get_tables(DatabaseName=databaseName)

tableList = Tables['TableList']

for table in tableList:
    tableName = table['Name']
    print '\n-- tableName: ' + tableName

    datasource0 = AthenaContext.create_dynamic_frame.from_catalog(
        database="tpc-ds-csv",
        table_name=tableName,
        transformation_ctx="datasource0"
    )

    datasink4 = AthenaContext.write_dynamic_frame.from_options(
        frame=datasource0,
        connection_type="s3",
        connection_options={
            "path": "s3://aws-Athena-tpcds-parquet/"+ tableName + "/"
            },
        format="parquet",
        transformation_ctx="datasink4"
    )
job.commit()


How can we monitor and manage Amazon QuickSight and Athena in our CI/CD pipeline?


Amazon Athena


How to tune your Performance of Athena?


Amazon Athena

  • Data Partitioning - used for dividing the table into simple parts and also can be kept related data together all based on various column values like date, country, region, etc.
  • Bucketing Data - this method is used for partioning data in a single partition.
  • Compress the Files - this method is used for increasing the query speed and can ensures the files are of optimal size and are splittable.
  • Optimization of File Sizes - used for having splittable file formats is helpful with parallelism irrespective of the size of files.
  • Optimization of the Data Store Generation - this is used for features such as efficient storage of data by using column-wise compression, different encoding, and also compression, which is based on data-type.
  • Optimization of ORDER - used for returning the results of the query in sorted order.

Why is Athena most suitable for Data Analytics?


Amazon Athena


What are the various file formats in which data is stored in?

The various file formats in which data is stored in are as follows:
CSV, war logs, text files
Apache weblogs
JSON
Compressed files
Apache Parquet or Apache ORC which are basically the columnar formats


How can we enter crawler name?

cars.json file is in the S3 location s3://rosyll-niranjana-xavier/data_input/json-files/cars.json. You can also choose s3://rosyll-niranjana-xavier/data_input/json-files/ as the path.


How to Create Dataframe from AWS Athena using Boto3 get query results method?

client = boto3.client('athena')
response = client.get_query_results(
        QueryExecutionId=res['QueryExecutionId']
        )


How to create Athena database via API?

import boto3

client = boto3.client('athena')

config = {'OutputLocation': 's3://TEST_BUCKET/'}

client.start_query_execution(
                             QueryString = 'create database TEST_DATABASE',
                             ResultConfiguration = config
)