ZooKeeper has a command line client support for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system.
Posted Date:- 2021-08-31 06:43:19
Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.
Posted Date:- 2021-08-31 06:42:40
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.
Posted Date:- 2021-08-31 06:40:53
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.
Posted Date:- 2021-08-31 06:40:04
The prime differences between the two are as follows:
1. Pig is a procedural query language, while SQL is declarative.
2. Pig follows a nested relational data model, while SQL follows a flat one.
3. Having schema in Pig is optional, but in SQL, it is mandatory.
4. Pig offers limited query optimization, but SQL offers significant optimization.
Posted Date:- 2021-08-31 06:38:48
It is like an interface that allows interaction with the HDFS. It is also sometimes referred to as Pig interactive shell.
Posted Date:- 2021-08-31 06:37:49
“Oozie†is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs such as “Java MapReduceâ€, “Streaming MapReduceâ€, “Pigâ€, “Hive†and “Sqoopâ€.
Posted Date:- 2021-08-31 06:36:29
RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.
Posted Date:- 2021-08-31 06:35:44
The components of a Region Server are:
1. WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
2. Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
3. MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
4. HFile: HFile is stored in HDFS. It stores the actual cells on the disk.
Posted Date:- 2021-08-31 06:35:02
“Derby database†is the default “Hive Metastoreâ€. Multiple users (processes) cannot access it at the same time. It is mainly used to perform unit tests.
Posted Date:- 2021-08-31 06:32:25
If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.
Posted Date:- 2021-08-31 06:31:44
Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].
Complex Data Types: Complex data types are Tuple, Map and Bag.
Posted Date:- 2021-08-31 06:27:14
“SequenceFileInputFormat†is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce†job to the input of some other “MapReduce†job.
Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.
Posted Date:- 2021-08-31 06:26:20
A “Combiner†is a mini “reducer†that performs the local “reduce†task. It receives the input from the “mapper†on a particular “node†and sends the output to the “reducerâ€. “Combiners†help in enhancing the efficiency of “MapReduce†by reducing the quantum of data that is required to be sent to the “reducersâ€.
Posted Date:- 2021-08-31 06:25:37
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.
Posted Date:- 2021-08-31 06:25:04
There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
Posted Date:- 2021-08-31 06:24:31
Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.
Posted Date:- 2021-08-31 06:23:59
Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-
1)CLOB ‘s – Character Large Objects
2)BLOB’s –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile†i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.
Posted Date:- 2021-08-31 06:23:21
/usr/bin/Hadoop Sqoop
Posted Date:- 2021-08-31 06:21:40
Yes, Sqoop supports two types of incremental imports-
1)Append
2)Last Modified
Posted Date:- 2021-08-31 06:20:32
Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.
Posted Date:- 2021-08-31 06:19:25
The “InputSplit†defines a slice of work, but does not describe how to access it. The “RecordReader†class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper†task. The “RecordReader†instance is defined by the “Input Formatâ€.
Posted Date:- 2021-08-31 06:18:51
The primary components of ZooKeeper architecture are:
Node: These are all the systems installed on the cluster.
ZNode: A different type of node that stores updates and data version information.
Client applications: These are the client-side applications useful in interacting with distributed applications.
Server applications: These are the server-side applications that provide an interface for the client applications to interact with the server.
Posted Date:- 2021-08-31 06:18:10
Interceptors are useful in filtering out unwanted log files. We can use them to eliminate events between the source and channel or the channel and sink based on our requirements.
Posted Date:- 2021-08-31 06:17:23
Sqoop metastore is a shared repository where multiple local and remorse users can execute the saved jobs. We can connect to the Sqoop metastore through sqoop-site.xml or with the help of the -meta-connect argument command.
Posted Date:- 2021-08-31 06:16:31
We can use JDBC-based imports for BLOB and CLOB, as Sqoop does not support direct import function.
Posted Date:- 2021-08-31 06:14:23
1. Heterogeneity: The design of applications should allow the users to access services and run applications over a heterogeneous collection of computers and networks taking into consideration Hardware devices, OS, networks, Programming languages.
2. Transparency: Distributed system Designers must hide the complexity of the system as much as they can. Some Terms of transparency are location, access, migration, Relocation, and so on.
3. Openness: It is a characteristic that determines whether the system can be extended and reimplemented in various ways.
4. Security: Distributed system Designers must take care of confidentiality, integrity, and availability.
5. Scalability: A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance.
Posted Date:- 2021-08-31 06:13:41
Apache Flume is a tool/service/data ingestion mechanism for assembling, aggregating, and carrying huge amounts of streaming data such as record files, events from various references to a centralized data store.
Flume is a very stable, distributed, and configurable tool. It is generally designed to copy streaming data (log data) from various web servers to HDFS.
Posted Date:- 2021-08-31 06:12:35
This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:
1. You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.
2. To stop and start all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.
Posted Date:- 2021-08-31 06:11:35
If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative executionâ€.
Posted Date:- 2021-08-31 06:10:42
Rack Awareness is the algorithm in which the “NameNode†decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes†within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rackâ€. This rule is known as the “Replica Placement Policyâ€.
Posted Date:- 2021-08-31 06:10:12
The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.
Posted Date:- 2021-08-31 06:09:39
HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing these metadata in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.
Posted Date:- 2021-08-31 06:07:22
The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience, you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.
Posted Date:- 2021-08-31 06:06:18
When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.
Posted Date:- 2021-08-31 06:05:17
The applications that are supported by Apache Hive are,
Java
PHP
Python
C++
Ruby
Posted Date:- 2021-08-31 06:02:30
On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.
Posted Date:- 2021-08-31 06:01:24
The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.
Posted Date:- 2021-08-31 06:00:32
The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
Posted Date:- 2021-08-31 05:59:56
1. Family delete marker: It marks all the columns for deletion from a column family.
2. Version delete marker: It marks a single version of a column for deletion.
3. Column delete marker: It marks all versions of a column for deletion.
Posted Date:- 2021-08-31 05:59:28
In brief, “Checkpointing†is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.
Posted Date:- 2021-08-31 05:58:34
The data is stored in memory by Apache Spark for faster processing and development of machine learning models, which may need a lot of Machine Learning algorithms for multiple repetitions and various conceptual steps to create an optimized model. In the case of Graph algorithms, it moves within all the nodes and edges to make a graph. These low latency workloads, which need many iterations, enhance the performance.
Posted Date:- 2021-08-31 05:57:59
The primary parameters of a mapper are text, LongWritable, text, and IntWritable. The initial two represent input parameters, and the other two signify intermediate output parameters.
Posted Date:- 2021-08-31 05:56:51
Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the files across all nodes in a Hadoop cluster. These files can be jar files or simple properties files.
Hadoop's MapReduce framework allows the facility to cache small to moderate read-only files such as text files, zip files, jar files, etc., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a copy of the file(local-copy), which is sent by Distributed Cache.
Posted Date:- 2021-08-31 05:56:22
RDBMS is a schema based database whereas HBase is schema less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
Posted Date:- 2021-08-31 05:55:56
Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.
Posted Date:- 2021-08-31 05:55:32
Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
Posted Date:- 2021-08-31 05:55:16
The client application is used to submit the jobs to the Jobtracker.
The JobTracker associates with the NameNode to determine the data location.
With the help of available slots and the near the data, JobTracker locates TaskTracker nodes.
It submits the work on the selected TaskTracker Nodes.
When a task fails, JobTracker notifies and decides the further steps.
JobTracker monitors the TaskTracker nodes
Posted Date:- 2021-08-31 05:52:11
NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.
Posted Date:- 2021-08-31 05:51:27
HDFS supports exclusive writes only.
When the first client contacts the “NameNode†to open the file for writing, the “NameNode†grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode†will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.
Posted Date:- 2021-08-31 05:50:50