A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective functionobjective_functionand they have defined the search spacesearch_space.
As a result, they have the following code block:
Which of the following changes do they need to make to the above code block in order to accomplish the task?
Correct Answer:A
TheSparkTrials()is used to distribute trials of hyperparameter tuning across a Spark cluster. If the environment does not support Spark or if the user prefers not to usedistributed computing for this purpose, switching toTrials()would be appropriate.Trials() is the standard class for managing search trials in Hyperopt but does not distribute the computation. If the user is encountering issues withSparkTrials()possibly due to an unsupported configuration or an error in the cluster setup, usingTrials()can be a suitable change for running the optimization locally or in a non-distributed manner.
References
✑ Hyperopt documentation: http://hyperopt.github.io/hyperopt/
A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".
Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?
Correct Answer:B
To register a model that has been identified by a specific run_id in the MLflow Model Registry, the appropriate line of code is: mlflow.register_model(f"runs:/{run_id}/model","best_model")
This code correctly specifies the path to the model within the run (runs:/{run_id}/model) and registers it under the name "best_model" in the Model Registry. This allows the model to be tracked, managed, and transitioned through different stages (e.g., Staging, Production) within the MLflow ecosystem.
References
✑ MLflow documentation on model registry: https://www.mlflow.org/docs/latest/model-registry.html#registering-a-model
A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.
Which of the following lines of code will return the metadata description?
Correct Answer:C
To retrieve the metadata description of a feature table created using the Feature Store Client (referred here asfs), the correct method involves callingget_tableon thefsclient with the table name as an argument, followed by accessing thedescription attribute of the returned object. The code snippetfs.get_table("new_table").description correctly achieves this by fetching the table object for "new_table" and then accessing its description attribute, where the metadata is stored. The other options do not correctly focus on retrieving the metadata description.References:
✑ Databricks Feature Store documentation (Accessing Feature Table Metadata).
A data scientist is using Spark ML to engineer features for an exploratory machine learning project.
They decide they want to standardize their features using the following code block:
Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?
Correct Answer:E
To address the concern about standardizing features prior to splitting the data, the correct approach is to use the Pipeline API to ensure that only the training data's summary statistics are used to standardize the test data. This is achieved by fitting the StandardScaler (or any scaler) on the training data and then transforming both the training and test data using the fitted scaler. This approach prevents information leakage from the test data into the model training process and ensures that the model is evaluated fairly. References:
✑ Best Practices in Preprocessing in Spark ML (Handling Data Splits and Feature Standardization).
In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?
Correct Answer:C
Imputing missing values with the median is often preferred over the mean in scenarios where the data contains a lot of extreme outliers. The median is a more robust measure of central tendency in such cases, as it is not as heavily influenced by outliers as the mean. Using the median ensures that the imputed values are more representative of the typical data point, thus preserving the integrity of the dataset's distribution. The other options are not specifically relevant to the question of handling outliers in numerical data. References:
✑ Data Imputation Techniques (Dealing with Outliers).