An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
Correct Answer:
E
One-hot encoding transforms categorical variables into a format that can be provided to machine learning algorithms to better predict the output. However, when done prematurely or universally within a feature repository, it can be problematic:
✑ Dimensionality Increase:One-hot encoding significantly increases the feature
space, especially with high cardinality features, which can lead to high memory consumption and slower computation.
✑ Model Specificity:Some models handle categorical variables natively (like decision
trees and boosting algorithms), and premature one-hot encoding can lead to inefficiency and loss of information (e.g., ordinal relationships).
✑ Sparse Matrix Issue:It often results in a sparse matrix where most values are zero,
which can be inefficient in both storage and computation for some algorithms.
✑ Generalization vs. Specificity:Encoding should ideally be tailored to specific models and use cases rather than applied generally in a feature repository.
References
✑ "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Max Kuhn and Kjell Johnson (CRC Press, 2019).
A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column??s median value.
They have developed the following code block to accomplish this task:
The code block is not accomplishing the task.
Which reasons describes why the code block is not accomplishing the imputation task?
Correct Answer:
D
In the provided code block, theImputerobject is created but not fitted on the data to generate anImputerModel. Thetransformmethod is being called directly on the Imputerobject, which does not yet contain the fitted median values needed for imputation. The correct approach is to fit the imputer on the dataset first.
Corrected code:
imputer = Imputer( strategy="median", inputCols=input_columns, outputCols=output_columns ) imputer_model = imputer.fit(features_df)# Fit the imputer to the dataimputed_features_df = imputer_model.transform(features_df)# Transform the data using the fitted imputer
References:
✑ PySpark ML Documentation
Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?
Correct Answer:
B
Spark ML (part of Apache Spark's MLlib) is designed to handle machine learning tasks across multiple nodes in a cluster, effectively parallelizing tasks like hyperparameter tuning. It supports various machine learning algorithms that can be optimized over a Spark cluster, making it suitable for parallelizing hyperparameter tuning for single-node machine learning models when they are adapted to run on Spark.
References
✑ Apache Spark MLlib Guide:https://spark.apache.org/docs/latest/ml-guide.html
Spark ML is a library within Apache Spark designed for scalable machine learning. It provides tools to handle large-scale machine learning tasks, including parallelizing the hyperparameter tuning process for single-node machine learning models using a Spark cluster. Here??s a detailed explanation of how Spark ML can be used:
✑ Hyperparameter Tuning with CrossValidator: Spark ML includes
theCrossValidatorandTrainValidationSplitclasses, which are used for hyperparameter tuning. These classes can evaluate multiple sets of hyperparameters in parallel using a Spark cluster.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Define the model model = ...
# Create a parameter grid paramGrid = ParamGridBuilder() \
addGrid(model.hyperparam1, [value1, value2]) \ addGrid(model.hyperparam2, [value3, value4]) \ build()
# Define the evaluator
evaluator = BinaryClassificationEvaluator()
# Define the CrossValidator
crossval = CrossValidator(estimator=model, estimatorParamMaps=paramGrid, evaluator=evaluator,
numFolds=3)
✑ Parallel Execution: Spark distributes the tasks of training models with different hyperparameters across the cluster??s nodes. Each node processes a subset of the parameter grid, which allows multiple models to be trained simultaneously.
✑ Scalability: Spark ML leverages the distributed computing capabilities of Spark.
This allows for efficient processing of large datasets and training of models across many nodes, which speeds up the hyperparameter tuning process significantly compared to single-node computations.
References
✑ Apache Spark MLlib Documentation
✑ Hyperparameter Tuning in Spark ML
