Free Databricks-Machine-Learning-Associate Exam Dumps

Question 6

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?

Correct Answer:B
To filter rows in a Spark DataFrame based on a condition, you use thefilter method along with a column condition. The correct syntax in PySpark to accomplish this task isspark_df.filter(col("price") > 0), which filters the DataFrame to include only those rows where the value in the "price" column is greater than 0. Thecolfunction is used to specify column-based operations. The other options provided either do not use correct Spark DataFrame syntax or are intended for different types of data manipulation frameworks like pandas.References:
✑ PySpark DataFrame API documentation (Filtering DataFrames).

Question 7

Which of the following statements describes a Spark ML estimator?

Correct Answer:D
In the context of Spark MLlib, an estimator refers to an algorithm which can be "fit" on a DataFrame to produce a model (referred to as a Transformer), which can then be used to transform one DataFrame into another, typically adding predictions or model
scores. This is a fundamental concept in machine learning pipelines in Spark, where the workflow includes fitting estimators to data to produce transformers.
References
✑ Spark MLlib Documentation:https://spark.apache.org/docs/latest/ml- pipeline.html#estimators

Question 8

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Correct Answer:C
For large datasets, Spark ML uses iterative optimization methods to distribute the training of a linear regression model. Specifically, Spark MLlib employs techniques like Stochastic Gradient Descent (SGD) and Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) optimization to iteratively update the model parameters. These methods are well-suited for distributed computing environments because they can handle large-scale data efficiently by processing mini-batches of data and updating the model incrementally.
References:
✑ Databricks documentation on linear regression: Linear Regression in Spark ML

Question 9

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed thetrain_modelfunction, and they want to apply it to each group of DataFramedf.
They have written the following incomplete code block:
Databricks-Machine-Learning-Associate dumps exhibit
Which of the following pieces of code can be used to fill in the above blank to complete the task?

Correct Answer:B
The functionmapInPandasin the PySpark DataFrame API allows for applying a function to each partition of the DataFrame. When working with grouped data,groupbyfollowed by applyInPandasis the correct approach to apply a function to each group as a separate Pandas DataFrame. However, if the function should apply across each partition of the grouped data rather than on each individual group,mapInPandaswould be utilized. Since the code snippet indicates the use ofgroupby, the intent seems to be to applytrain_model on each group specifically, which aligns withapplyInPandas. Thus,applyInPandasis a better fit to ensure that each group generated bygroupbyis processed through the train_modelfunction, preserving the partitioning and grouping integrity.
References
✑ PySpark Documentation on applying functions to grouped data:https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Gro upedData.applyInPandas.html

Question 10

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.
They have developed this code block to accomplish this task:
Databricks-Machine-Learning-Associate dumps exhibit
The code block is returning an error.
Which of the following adjustments does the data scientist need to make to accomplish this task?

Correct Answer:C
TheOneHotEncoderin Spark ML requires numerical indices as inputs rather than string labels. Therefore, you need to first convert the string columns to numerical indices usingStringIndexer. After that, you can applyOneHotEncoderto these indices. Corrected code:
frompyspark.ml.featureimportStringIndexer, OneHotEncoder# Convert string column to indexindexers = [StringIndexer(inputCol=col, outputCol=col+"_index")forcolininput_columns] indexer_model = Pipeline(stages=indexers).fit(features_df) indexed_features_df = indexer_model.transform(features_df)# One-hot encode the indexed columnsohe = OneHotEncoder(inputCols=[col+"_index"forcolininput_columns], outputCols=output_columns) ohe_model = ohe.fit(indexed_features_df) ohe_features_df = ohe_model.transform(indexed_features_df)
References:
✑ PySpark ML Documentation