
[Apr-2023] Associate-Developer-Apache-Spark Questions - Truly Beneficial For Your Databricks Exam
Download Databricks Associate-Developer-Apache-Spark Sample Questions
NEW QUESTION # 92
Which of the following code blocks generally causes a great amount of network traffic?
- A. DataFrame.rdd.map()
- B. DataFrame.select()
- C. DataFrame.count()
- D. DataFrame.collect()
- E. DataFrame.coalesce()
Answer: D
Explanation:
Explanation
DataFrame.collect() sends all data in a DataFrame from executors to the driver, so this generally causes a great amount of network traffic in comparison to the other options listed.
DataFrame.coalesce() just reduces the number of partitions and generally aims to reduce network traffic in comparison to a full shuffle.
DataFrame.select() is evaluated lazily and, unless followed by an action, does not cause significant network traffic.
DataFrame.rdd.map() is evaluated lazily, it does therefore not cause great amounts of network traffic.
DataFrame.count() is an action. While it does cause some network traffic, for the same DataFrame, collecting all data in the driver would generally be considered to cause a greater amount of network traffic.
NEW QUESTION # 93
Which of the following DataFrame operators is never classified as a wide transformation?
- A. DataFrame.aggregate()
- B. DataFrame.select()
- C. DataFrame.sort()
- D. DataFrame.repartition()
- E. DataFrame.join()
Answer: B
Explanation:
Explanation
As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide transformation.
DataFrame.select()
Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select() operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each partition can be worked on independently. Thus, you do not cause a wide transformation.
DataFrame.repartition()
Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is known as a shuffle and, in turn, is classified as a wide transformation.
DataFrame.aggregate()
No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.
DataFrame.join()
Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join() counts as a wide transformation.
DataFrame.sort()
False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation.
More info: Understanding Apache Spark Shuffle | Philipp Brunenberg
NEW QUESTION # 94
Which of the following statements about executors is correct?
- A. Executors are launched by the driver.
- B. Executors store data in memory only.
- C. Each node hosts a single executor.
- D. Executors stop upon application completion by default.
- E. An executor can serve multiple applications.
Answer: D
Explanation:
Explanation
Executors stop upon application completion by default.
Correct. Executors only persist during the lifetime of an application.
A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle, independent of whether the application has been completed or not.
An executor can serve multiple applications.
Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above).
Each node hosts a single executor.
No. Each node can host one or more executors.
Executors store data in memory only.
No. Executors can store data in memory or on disk.
Executors are launched by the driver.
Incorrect. Executors are launched by the cluster manager on behalf of the driver.
More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to clear some of the... | by Mageswaran D | Medium
NEW QUESTION # 95
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.__1__(__2__).select(__3__, __4__)
- A. 1. filter
2. col("supplier").contains("Sports")
3. "itemName"
4. explode("attributes") - B. 1. where
2. col("supplier").contains("Sports")
3. "itemName"
4. "attributes" - C. 1. where
2. col(supplier).contains("Sports")
3. explode(attributes)
4. itemName - D. 1. where
2. "Sports".isin(col("Supplier"))
3. "itemName"
4. array_explode("attributes") - E. 1. filter
2. col("supplier").isin("Sports")
3. "itemName"
4. explode(col("attributes"))
Answer: A
Explanation:
Explanation
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this question is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer.
The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col ("supplier").contains("Sports") and col("supplier").isin("Sports"). The question states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode ("attributes") will help us achieve our goal. Specifically, the question asks for one attribute from column attributes per row - this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 96
Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?
- A. itemsDf.persist(StorageLevel.MEMORY_ONLY)
- B. itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
- C. itemsDf.cache()
- D. itemsDf.store()
- E. itemsDf.write.option('destination', 'memory').save()
Answer: C
Explanation:
Explanation
The key to solving this question is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.
If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 97
The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching value in column itemId with a value in column transactionsId of DataFrame transactionsDf. Find the error.
Code block:
itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)
- A. The join expression is malformed.
- B. The join method is inappropriate.
- C. The merge method should be used instead of join.
- D. The join statement is incomplete.
- E. The union method should be used instead of join.
Answer: D
Explanation:
Explanation
Correct code block:
itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.transactionId) The join statement is incomplete.
Correct! If you look at the documentation of DataFrame.join() (linked below), you see that the very first argument of join should be the DataFrame that should be joined with. This first argument is missing in the code block.
The join method is inappropriate.
No. By default, DataFrame.join() uses an inner join. This method is appropriate for the scenario described in the question.
The join expression is malformed.
Incorrect. The join expression itemsDf.itemId==transactionsDf.transactionId is correct syntax.
The merge method should be used instead of join.
False. There is no DataFrame.merge() method in PySpark.
The union method should be used instead of join.
Wrong. DataFrame.union() merges rows, but not columns as requested in the question.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation, pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 98
The code block shown below should add a column itemNameBetweenSeparators to DataFrame itemsDf. The column should contain arrays of maximum 4 strings. The arrays should be composed of the values in column itemsDf which are separated at - or whitespace characters. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-------------------+
2.|itemId|itemName |supplier |
3.+------+----------------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |YetiX |
6.|3 |Outdoors Backpack |Sports Company Inc.|
7.+------+----------------------------------+-------------------+
Code block:
itemsDf.__1__(__2__, __3__(__4__, "[\s\-]", __5__))
- A. 1. withColumn
2. "itemNameBetweenSeparators"
3. split
4. "itemName"
5. 5 - B. 1. withColumn
2. itemNameBetweenSeparators
3. str_split
4. "itemName"
5. 5 - C. 1. withColumn
2. "itemNameBetweenSeparators"
3. split
4. "itemName"
5. 4
(Correct) - D. 1. withColumnRenamed
2. "itemNameBetweenSeparators"
3. split
4. "itemName"
5. 4 - E. 1. withColumnRenamed
2. "itemName"
3. split
4. "itemNameBetweenSeparators"
5. 4
Answer: C
Explanation:
Explanation
This question deals with the parameters of Spark's split operator for strings.
To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider.
The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is
"itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider.
The decision boils down to how to fill gap 5. Either with 4 or with 5. The question asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark's split operator (see documentation linked below). The documentation states that "the resulting array's length will not be more than limit", meaning that we should pick the answer option with 4 as the code in the fifth gap here.
On a side note: One answer option includes a function str_split. This function does not exist in pySpark.
More info: pyspark.sql.functions.split - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 99
The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.
Code block:
transactionsDf.filter(col('predError').in([3, 6])).count()
- A. Instead of a list, the values need to be passed as single arguments to the in operator.
- B. Numbers 3 and 6 need to be passed as string variables.
- C. The method used on column predError is incorrect.
- D. The number of rows cannot be determined with the count() operator.
- E. Instead of filter, the select method should be used.
Answer: C
Explanation:
Explanation
Correct code block:
transactionsDf.filter(col('predError').isin([3, 6])).count()
The isin method is the correct one to use here - the in method does not exist for the Column object.
More info: pyspark.sql.Column.isin - PySpark 3.1.2 documentation
NEW QUESTION # 100
The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)
- A. 1. store
2. with
3. "replacement"
4. "compression"
5. path - B. 1. write
2. mode
3. "overwrite"
4. "compression"
5. save
(Correct) - C. 1. save
2. mode
3. "replace"
4. "compression"
5. path - D. 1. write
2. mode
3. "overwrite"
4. compression
5. parquet - E. 1. save
2. mode
3. "ignore"
4. "compression"
5. path
Answer: C
Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").mode("overwrite").option("compression", "snappy").save(storeDir) Solving this question requires you to know how to access the DataFrameWriter (link below) from the DataFrame API - through DataFrame.write.
Another nuance here is about knowing the different modes available for writing parquet files that determine Spark's behavior when dealing with existing files. These, together with the compression options are explained in the DataFrameWriter.parquet documentation linked below.
Finally, bracket __5__ poses a certain challenge. You need to know which command you can use to pass down the file path to the DataFrameWriter. Both save and parquet are valid options here.
More info:
- DataFrame.write: pyspark.sql.DataFrame.write - PySpark 3.1.1 documentation
- DataFrameWriter.parquet: pyspark.sql.DataFrameWriter.parquet - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION # 101
Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?
- A. 1, 10
- B. 1, 4, 6, 9
- C. 0
- D. 1, 8
- E. 7, 9, 10
Answer: D
Explanation:
Explanation
1: Correct - This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.
4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.
6: No, StringType is a correct type.
7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.
8: Correct - TreeType is not a type that Spark supports.
9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.
10: There is nothing wrong with this row.
More info: Data Types - Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)
NEW QUESTION # 102
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted?
Excerpt of DataFrame transactionsDf:
- A. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")
- B. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
- C. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
- D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
- E. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
Answer: C
Explanation:
Explanation
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy")) Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column transactionDate into strings, following the format requested in the question.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy")) No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.
transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.
transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted") Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped data - but this is irrelevant for this question, since we do not deal with grouped data here.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this:
2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not what is asked for in the question.
More info: pyspark.sql.functions.from_unixtime - PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
NEW QUESTION # 103
The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is at least 5. Find the error.
Code block:
transactionsDf.where("col(predError) >= 5")
- A. Instead of >=, the SQL operator GEQ should be used.
- B. The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").
- C. The argument to the where method should be "predError >= 5".
- D. Instead of where(), filter() should be used.
- E. The argument to the where method cannot be a string.
Answer: C
Explanation:
Explanation
The argument to the where method cannot be a string.
It can be a string, no problem here.
Instead of where(), filter() should be used.
No, that does not matter. In PySpark, where() and filter() are equivalent.
Instead of >=, the SQL operator GEQ should be used.
Incorrect.
The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").
No, Spark returns a new DataFrame.
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/27.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION # 104
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned DataFrame?
- A. transactionsDf.resample(0.15, False, 3142)
- B. transactionsDf.sample(0.15, False, 3142)
- C. transactionsDf.sample(0.15)
- D. transactionsDf.sample(True, 0.15, 8261)
- E. transactionsDf.sample(0.85, 8429)
Answer: D
Explanation:
Explanation
Answering this question correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows:
DataFrame.sample(withReplacement=None, fraction=None, seed=None).
The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument.
About replacement: "Replacement" is easiest explained with the example of removing random items from a box. When you remove those "with replacement" it means that after you have taken an item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. "Without replacement" means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the same item twice.
The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the question we are asked for 150 out of 1000 items - a fraction of 0.15.
The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you!
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1
NEW QUESTION # 105
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?
- A. transactionsDf.select("storeId").distinct()
(Correct) - B. transactionsDf.filter("storeId").distinct()
- C. transactionsDf.distinct("storeId")
- D. transactionsDf.select(col("storeId").distinct())
- E. transactionsDf["storeId"].distinct()
Answer: A
Explanation:
Explanation
distinct() is a method of a DataFrame. Knowing this, or recognizing this from the documentation, is the key to solving this question.
More info: pyspark.sql.DataFrame.distinct - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 106
The code block shown below should set the number of partitions that Spark uses when shuffling data for joins or aggregations to 100. Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.sql.shuffle.partitions
__1__.__2__.__3__(__4__, 100)
- A. 1. spark
2. conf
3. set
4. "spark.sql.aggregate.partitions" - B. 1. pyspark
2. config
3. set
4. spark.shuffle.partitions - C. 1. pyspark
2. config
3. set
4. "spark.sql.shuffle.partitions" - D. 1. spark
2. conf
3. set
4. "spark.sql.shuffle.partitions" - E. 1. spark
2. conf
3. get
4. "spark.sql.shuffle.partitions"
Answer: D
Explanation:
Explanation
Correct code block:
spark.conf.set("spark.sql.shuffle.partitions", 100)
The conf interface is part of the SparkSession, so you need to call it through spark and not pyspark. To configure spark, you need to use the set method, not the get method. get reads a property, but does not write it. The correct property to achieve what is outlined in the question is spark.sql.aggregate.partitions, which needs to be passed to set as a string. Properties spark.shuffle.partitions and spark.sql.aggregate.partitions do not exist in Spark.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 107
The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')
- A. 1. withColumnRenamed
2. 'associateId'
3. 5
4. drop
5. 'productId' - B. 1. withNewColumn
2. associateId
3. lit(5)
4. drop
5. productId - C. 1. withColumn
2. 'associateId'
3. 5
4. remove
5. 'productId' - D. 1. withColumn
2. col(associateId)
3. lit(5)
4. drop
5. col(productId) - E. 1. withColumn
2. 'associateId'
3. lit(5)
4. drop
5. 'productId'
Answer: E
Explanation:
Explanation
Correct code block:
transactionsDf.withColumn('associateId', lit(5)).drop('productId', 'value') For solving this question it is important that you know the lit() function (link to documentation below). This function enables you to add a column of a constant value to a DataFrame.
More info: pyspark.sql.functions.lit - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1
NEW QUESTION # 108
The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.
Find the error.
Code block:
1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")
Instead of calling spark.createDataFrame, just DataFrame should be called.
- A. Instead of color, a data type should be specified.
- B. The "color" expression needs to be wrapped in brackets, so it reads ["color"].
- C. The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.
- D. The commas in the tuples with the colors should be eliminated.
Answer: B
Explanation:
Explanation
Correct code block:
spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"])
The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the code block presented here which should help you answer this question correctly.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION # 109
Which of the following is the deepest level in Spark's execution hierarchy?
- A. Slot
- B. Task
- C. Stage
- D. Executor
- E. Job
Answer: B
Explanation:
Explanation
The hierarchy is, from top to bottom: Job, Stage, Task.
Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
NEW QUESTION # 110
The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before
2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.
Schema:
1.root
2. |-- itemId: integer (nullable = true)
3. |-- attributes: array (nullable = true)
4. | |-- element: string (containsNull = true)
5. |-- supplier: string (nullable = true)
Code block:
1.schema = StructType([
2. StructType("itemId", IntegerType(), True),
3. StructType("attributes", ArrayType(StringType(), True), True),
4. StructType("supplier", StringType(), True)
5.])
6.
7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)
- A. Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
- B. The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
- C. Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
- D. The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
- E. Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
Answer: A
Explanation:
Explanation
Correct code block:
schema = StructType([
StructField("itemId", IntegerType(), True),
StructField("attributes", ArrayType(StringType(), True), True),
StructField("supplier", StringType(), True)
])
spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath) This question is more difficult than what you would encounter in the exam. In the exam, for this question type, only one error needs to be identified and not "one or multiple" as in the question.
Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the question is wrong.
The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original code block (see documentation linked below).
Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for example, DataFrameReader.parquet().
Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be nullable and this is specified correctly by the third argument being True in the schema in the code block.
It is correct, however, that the modification date threshold is specified incorrectly (see above).
The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer above. In addition, the DataFrameReader is called correctly through the SparkSession spark.
Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.
The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified incorrectly (see correct answer above).
NEW QUESTION # 111
Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?
Schema of first partition:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
Schema of second partition:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- rollId: integer (nullable = true)
7. |-- f: integer (nullable = true)
8. |-- tax_id: integer (nullable = false)
- A. spark.read.option("mergeSchema", "true").parquet(filePath)
- B. 1.nx = 0
2.for file in dbutils.fs.ls(filePath):
3. if not file.name.endswith(".parquet"):
4. continue
5. df_temp = spark.read.parquet(file.path)
6. if nx == 0:
7. df = df_temp
8. else:
9. df = df.union(df_temp)
10. nx = nx+1
11.df - C. spark.read.parquet(filePath, mergeSchema='y')
- D. spark.read.parquet(filePath)
- E. 1.nx = 0
2.for file in dbutils.fs.ls(filePath):
3. if not file.name.endswith(".parquet"):
4. continue
5. df_temp = spark.read.parquet(file.path)
6. if nx == 0:
7. df = df_temp
8. else:
9. df = df.join(df_temp, how="outer")
10. nx = nx+1
11.df
Answer: A
Explanation:
Explanation
This is a very tricky question and involves both knowledge about merging as well as schemas when reading parquet files.
spark.read.option("mergeSchema", "true").parquet(filePath)
Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or more columns with the same name that appear in both partitions would have different data types.
spark.read.parquet(filePath)
Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition (e.g. tax_id) would be lost.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith(".parquet"):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.union(df_temp)
nx = nx+1
df
Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical data types.
spark.read.parquet(filePath, mergeSchema="y")
False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a boolean or string variable. But 'y' is not a valid option.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith(".parquet"):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.join(df_temp, how="outer")
nx = nx+1
df
No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the question says all columns that are included in the partitions should appear exactly once.
More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 112
Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate format for this kind of data?
- A. 1.spark.read.schema([
2. StructField("transactionId", IntegerType(), True),
3. StructField("predError", IntegerType(), True)
4. ]).load(filePath, format="parquet") - B. 1.spark.read.schema(
2. StructType([
3. StructField("transactionId", StringType(), True),
4. StructField("predError", IntegerType(), True)]
5. )).parquet(filePath) - C. 1.spark.read.schema(
2. StructType([
3. StructField("transactionId", IntegerType(), True),
4. StructField("predError", IntegerType(), True)]
5. )).format("parquet").load(filePath) - D. 1.spark.read.schema(
2. StructType(
3. StructField("transactionId", IntegerType(), True),
4. StructField("predError", IntegerType(), True)
5. )).load(filePath) - E. 1.spark.read.schema([
2. StructField("transactionId", NumberType(), True),
3. StructField("predError", IntegerType(), True)
4. ]).load(filePath)
Answer: C
Explanation:
Explanation
The schema passed into schema should be of type StructType or a string, so all entries in which a list is passed are incorrect.
In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here.
NumberType() is not a valid data type and StringType() would fail, since the parquet file is stored in the "most appropriate format for this kind of data", meaning that it is most likely an IntegerType, and Spark does not convert data types if a schema is provided.
Also note that StructType accepts only a single argument (a list of StructFields). So, passing multiple arguments is invalid.
Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here, since Spark assumes parquet as a default when no file format is specifically passed.
More info: pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation and StructType - PySpark 3.1.2 documentation
NEW QUESTION # 113
Which of the following code blocks displays various aggregated statistics of all columns in DataFrame transactionsDf, including the standard deviation and minimum of values in each column?
- A. transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min").show()
- B. transactionsDf.summary("count", "mean", "stddev", "25%", "50%", "75%", "max").show()
- C. transactionsDf.summary()
- D. transactionsDf.agg("count", "mean", "stddev", "25%", "50%", "75%", "min")
- E. transactionsDf.summary().show()
Answer: E
Explanation:
Explanation
The DataFrame.summary() command is very practical for quickly calculating statistics of a DataFrame. You need to call .show() to display the results of the calculation. By default, the command calculates various statistics (see documentation linked below), including standard deviation and minimum.
Note that the answer that lists many options in the summary() parentheses does not include the minimum, which is asked for in the question.
Answer options that include agg() do not work here as shown, since DataFrame.agg() expects more complex, column-specific instructions on how to aggregate values.
More info:
- pyspark.sql.DataFrame.summary - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.agg - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 114
The code block shown below should return a copy of DataFrame transactionsDf with an added column cos.
This column should have the values in column value converted to degrees and having the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))
- A. 1. withColumn
2. "cos"
3. cos
4. degrees
5. transactionsDf.value - B. 1. withColumnRenamed
2. "cos"
3. cos
4. degrees
5. "transactionsDf.value" - C. 1. withColumn
2. col("cos")
3. cos
4. degrees
5. col("value")
E
. 1. withColumn
2. "cos"
3. degrees
4. cos
5. col("value") - D. 1. withColumn
2. col("cos")
3. cos
4. degrees
5. transactionsDf.value
Answer: A
Explanation:
Explanation
Correct code block:
transactionsDf.withColumn("cos", round(cos(degrees(transactionsDf.value)),2)) This question is especially confusing because col, "cos" are so similar. Similar-looking answer options can also appear in the exam and, just like in this question, you need to pay attention to the details to identify what the correct answer option is.
The first answer option to throw out is the one that starts with withColumnRenamed: The question NO:
speaks specifically of adding a column. The withColumnRenamed operator only renames an existing column, however, so you cannot use it here.
Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn().
Looking at the documentation (linked below), you can find out that the first argument of withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col("cos") as the option for gap 2 can be disregarded.
This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the question you can find out that the new column should have "the values in column value converted to degrees and having the cosine of those converted values taken". This prescribes you a clear order of operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then, logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.
More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION # 115
......
Truly Beneficial For Your Databricks Exam: https://certblaster.lead2passed.com/Databricks/Associate-Developer-Apache-Spark-practice-exam-dumps.html