Описание: Chapter 1: Setting up the Pyspark Environment
Chapter Goal: Introduce readers to the PySpark environment, walk them through steps to setup the environment and execute some basic operations
Number of pages: 20
Subtopics:
1. Setting up your environment & data
2. Basic operations
Chapter 2: Basic Statistics and Visualizations
Chapter Goal: Introduce readers to predictive model building framework and help them acclimate with basic data operations
Number of pages: 30
Subtopics:
1. Basic Statistics
2. data manipulations/feature engineering
3. Data visualizations
4. Model building framework
Chapter 3: Variable Selection
Chapter Goal: Illustrate the different variable selection techniques to identify the top variables in a dataset and how they can be implemented using PySpark pipelines
Number of pages: 40
Subtopics:
1. Principal Component Analysis2. Weight of Evidence & Information Value
3. Chi square selector
4. Singular Value Decomposition
5. Voting based approach
Chapter 4: Introduction to different supervised machine algorithms, implementations & Fine-tuning techniques
Chapter Goal: Explain and demonstrate supervised machine learning techniques and help the readers to understand the challenges, nuances of model fitting with multiple evaluation metrics
Number of pages: 40
Subtopics:
1. Supervised:
- Linear regression
- Logistic regression
- Decision Trees
- Random Forests
- Gradient Boosting
- Neural Nets
- Support Vector Machine
- One Vs Rest Classifier
- Naive Bayes
2. Model hyperparameter tuning:
- L1 & L2 regularization- Elastic net
Chapter 5: Model Validation and selecting the best model
Chapter Goal: Illustrate the different techniques used to validate models, demonstrate which technique should be used for a particular model selection task and finally pick the best model out of the candidate models
Number of pages: 30
Subtopics:
1. Model Validation Statistics:
- ROC
- Accuracy- Precision
- Recall
- F1 Score
- Misclassification
- KS
- Decile
- Lift & Gain
- R square
- Adj