Free Practice Questions for Databricks Certified Machine Learning Associate Exam (Databricks-Machine-Learning-Associate)

QUESTION 11

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:
Databricks-Machine-Learning-Associate dumps exhibit
They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:

Which of the following lines of code can be used to complete the code block to successfully complete the task?

A. predict(*spark_df.columns)
B. mapInPandas(predict)
C. predict(Iterator(spark_df))
D. mapInPandas(predict(spark_df.columns))
E. predict(spark_df.columns)

Correct Answer: B
To apply the Pandas UDFpredictto each record of a Spark DataFrame, you use themapInPandasmethod. This method allows the Pandas UDF to operate on partitions of the DataFrame as pandas DataFrames, applying the specified function (predictin this case) to each partition. The correct code completion to execute this is simply mapInPandas(predict), which specifies the UDF to use without additional arguments orincorrect function calls.References:
✑ PySpark DataFrame documentation (Using mapInPandas with UDFs).

QUESTION 12

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

A. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata
B. pandas API on Spark DataFrames are more performant than Spark DataFrames
C. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata
D. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames
E. pandas API on Spark DataFrames are unrelated to Spark DataFrames

Correct Answer: C
Pandas API on Spark (previously known as Koalas) provides a pandas-like API on top of Apache Spark. It allows users to perform pandas operations on large datasets using Spark's distributed compute capabilities. Internally, it uses Spark DataFrames and adds metadata that facilitates handling operations in a pandas-like manner, ensuring compatibility and leveraging Spark's performance and scalability. References
✑ pandas API on Spark
documentation:https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html

QUESTION 13

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train- validation split could be preferred over k-fold cross-validation when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k- fold cross-validation in this scenario?

A. A holdout set is not necessary when using a train-validation split
B. Reproducibility is achievable when using a train-validation split
C. Fewer hyperparameter values need to be tested when usinga train-validation split
D. Bias is avoidable when using a train-validation split
E. Fewer models need to be trained when using a train-validation split

Correct Answer: E
A train-validation split is often preferred over k-fold cross-validation (with k > 2) when computational efficiency is a concern. With a train-validation split, only two models (one on the training set and one on the validation set) are trained, whereas k-fold cross- validation requires training k models (one for each fold).
This reduction in the number of models trained can save significant computational resources and time, especially when dealing with large datasets or complex models. References:
✑ Model Evaluation with Train-Test Split

QUESTION 14

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.
Databricks-Machine-Learning-Associate dumps exhibit
Which of the following suggestions should the team include in their guidelines?

A. The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.
B. The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.
C. The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.
D. The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Correct Answer: C
The F1 score is the harmonic mean of precision and recall and is particularly useful in situations where there is a significant imbalance between positive and negative classes. When there is a class imbalance, accuracy can be misleading because a model can achieve high accuracy by simply predicting the majority class. The F1 score, however, provides a better measure of the test's accuracy in terms of both false positives and false negatives.
Specifically, the F1 score should be used over accuracy when:
✑ There is a significant imbalance between positive and negative classes.
✑ Avoiding false negatives is a priority, meaning recall (the ability to detect all positive instances) is crucial.
In this scenario, the F1 score balances both precision (the ability to avoid false positives) and recall, providing a more meaningful measure of a model??s performance under these conditions.
References:
✑ Databricks documentation on classification metrics: Classification Metrics

QUESTION 15

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.
batch_dfhas the following schema: customer_id STRING
The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:
Databricks-Machine-Learning-Associate dumps exhibit
In which situation will the machine learning engineer??s code block perform the desired
inference?

A. When the Feature Store feature set was logged with the model at model_uri
B. When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark
C. When the model at model_uri only uses customer_id as a feature
D. This code block will not perform the desired inference in any situation.
E. When all of the features used by the model at model_uri are in a single Feature Store table

Correct Answer: A
The code block provided by the machine learning engineer will perform the desired inference when the Feature Store feature set was logged with the model at model_uri. This ensures that all necessary feature transformations and metadata are available for the model to make predictions. The Feature Store in Databricks allows for seamless integration of features and models, ensuring that the required features are correctly used during inference.
References:
✑ Databricks documentation on Feature Store: Feature Store in Databricks

Databricks-Machine-Learning-Associate Dumps

Databricks-Machine-Learning-Associate Free Practice Test

Databricks Databricks-Machine-Learning-Associate: Databricks Certified Machine Learning Associate Exam

Practice Test