1 |
How is logistic regression done? Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid). |
2 |
Explain the steps in making a decision tree. 1. Take the entire data set as input 2. Calculate entropy of the target variable, as well as the predictor attributes 3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other) 4. Choose the attribute with the highest information gain as the root node 5. Repeat the same procedure on every branch until the decision node of each branch is finalized |
3 |
How do you build a random forest model? A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together. |
4 |
Steps to build a random forest model: 1. Randomly select 'k' features from a total of 'm' features where k<<m 2. Among the 'k' features, calculate the node D using the best split point 3. Split the node into daughter nodes using the best split 4. Repeat steps two and three until leaf nodes are finalized 5. Build forest by repeating steps one to four for 'n' times to create 'n' number of trees |
5 |
How can you avoid overfitting your model? Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting: 1. Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data 2. Use cross-validation techniques, such as k folds cross-validation 3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting |
6 |
What are dimensionality reduction and its benefits? The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there's no point in storing a value in two different units (meters and inches). |
7 |
How can you select k for k-means? We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where 'k' is the number of clusters. Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid |
8 |
What is a Confusion Matrix? The Confusion Matrix is the summary of prediction results of a particular problem. It is a table that is used to describe the performance of the model. The Confusion Matrix is an n*n matrix that evaluates the performance of the classification model. |
Комментарии