Decision Tree Algorithm

Decision Tree:

A Decision tree algorithm is a  supervised machine learning algorithm used for both regression and classification problems. It divides the data into subsets recursively, at each stage, according to the most important attribute. These divisions create a structure like a tree, with each internal node denoting a decision made in response to an attribute, each branch denoting a result of that decision, and each leaf node representing the final predicted label or value.

At each step, the algorithm selects the attribute that provides the best split or separation of the data into different classes. This is usually done using metrics such as mean squared error (for regression).

A node is created in the tree for the selected attribute. Each branch from this node represents a possible value of that particular attribute.. The data is partitioned into subsets based on the values of the selected attribute. Each subset corresponds to a branch from the current node. The process is then repeated for each subset, treating it as a new set of data. This recursive process continues until a stopping condition is met.Stopping conditions may include reaching a specified depth of the tree, having a minimum number of samples in a leaf node, or achieving a certain level of purity (homogeneity) in the leaf nodes.Once a stopping condition is met, each leaf node is assigned the majority class label (for classification) or the mean value (for regression) of the samples in that leaf.

 

Generally this Decision Tree Algorithm has a very Good characteristics like auotmatic feature Selection and also it was easy to intrepret and understnad and most probably it doesnot require much data preprocessing.

 

Kmeans Kmedoids and DBSCAN

K-means Clustering:

Using the K-means clustering technique, data is grouped into k clusters by allocating each data point to the cluster whose mean value is the closest. A report shows that if k is set to 2, k-means effectively divides the data into two distinct clusters when applied to a dataset shaped like a lemniscate, or an infinity symbol. Splitting the data into smaller pieces still produces a reasonable result when k is increased to 4.

K-Medoids:

Similar to k-means clustering, k-medoids clustering employs the medoid—the cluster’s most central point—instead of the mean. Additionally, for k = 2, k-medoids successfully divide the lemniscate data into two clusters. Similar to k-means but with an emphasis on medoids rather than means, k-medoids for k = 4 generates clusters around the most central data points.

DBSCAN:(Density-Based Spatial Clustering of Applications with Noise)

In contrast, DBSCAN clustering creates clusters around regions with high densities of data points. It is less susceptible to the effects of outliers than k-means and k-medoids. DBSCAN detected four clusters using the lemniscate dataset, identifying dense regions divided by less dense areas.

 

 

 

 

Monte Carlo Simulation

Monte -Carlo Simulation:

In order to model and analyze complex systems or processes, Monte Carlo simulation is a computational technique that makes use of random sampling.  The Monte Carlo simulation can be seen in this overview:

A three steps were involved by starting with this Monte-carlo Simulation.

Define the parameters and variables:

Decide which important parameters and variables affect the system or process you wish to model. These could consist of output variables, constants, and input variables. Give a precise description of each variable’s attributes and range.
Give Out Probability Distributions:

Give the probability distribution for each variable that best captures the uncertainty surrounding that variable. Uniform, triangular, normal (Gaussian), and custom probability distributions based on past data or expert knowledge are examples of common probability distributions.

Create Random Samples:Using the assigned probability distributions as a guide, create a large number of random samples for each variable using random number generators. The required degree of precision and system complexity determine how many samples are needed.

 

 

Permutation Test

Permutation Test: The permutation test, can also be called  as a randomization or re-randomization test, which is a non-parametric statistical method for determining the significance of observed differences . When the prerequisites of traditional parametric tests, such as t-tests or ANOVA, cannot be met, or when working with small sample sizes, this method comes in useful.

Formulation of a Hypothesis:

The null hypothesis (H0) states that no effect exists and that any observed difference is due to random chance.
Alternative hypothesis (H1): There is a significant effect.
Observed Test Statistic Calculation:

Based on the observed data, compute the test statistic. Calculating the difference in means, correlation coefficient, or another relevant measure could be included.

 

Permutation Method:

Combine all data points from the compared groups.
To generate a new set of data points, shuffle or permute the data at random.
Calculation of the Test Statistic for Each Permutation:

Reevaluate the test statistic for each permuted dataset.
The Permutation Distribution is built as follows:

Create a distribution of test statistics based on the permuted datasets.
Comparison of Observed Statistic with Permutation Distribution:

Determine where the observed test statistic falls in the distribution. If it is notably extreme in comparison to the permuted distribution, the null hypothesis may be rejected.

P-value computation:

The p-value represents the probability of seeing a test statistic as extreme as the one computed from the observed data, assuming the null hypothesis is correct. A small p-value indicates evidence against the null hypothesis, especially if the observed test statistic is in the extreme tail of the distribution.

 

 

 

 

Cross Validation

CrossValidation: Cross Validation is a technique which was used to evaluate the performance of particular model like regression or classification model. here we will use K fold cross validation which was mainly used for machine learning and Statistical models .

K-fold Cross Validation: In this we will split the data or divide it into K roughly equal-sized parts or “folds.” Typically, K is a number like 5 or 10, but it can vary based on the requirements.

And then we will train and evaluate the model on the data in the training set for that particular  iteration. we use the model to make predictions on the data in the testing set as well. we will use the performance metric MSE(Mean squared error) that how well the model is performed on the testing set for that specific iteration.

Mean Squared Error(MSE):Mean Squared Error (MSE) is a commonly used performance  metric in statistics and machine learning to measure the average squared difference between the values predicted by a model and the actual  values in a dataset. It defines  how well a predictive model is performing to get the desired outcome with an accuracy .

Linear Regression with two Predictor variables

Linear Regression with Two Predictor Variables:

In this type of regression, we use two predictor variables, X1 and X2, to predict an outcome variable, Y. We’re not only interested in their individual effects but also in how they interact (X1*X2) and if there are any quadratic effects. This allows us to create a more complex model that considers how these variables work together to give desired outcome as Y.

Y = β0 + β1*X1 + β2*X2 + β3*X1*X2 + ε

Where Y is the value that we are going to predict.

X1 and X2 are  two predictor variables. β0 is the intercept term. β1 and β2 are the coefficients for the linear effects of X1 and X2. β3 represents the coefficient for the interaction term (X1*X2). ε is the error term.

Overfitting: Overfitting happens when our model works really well on the data it was trained on (the data it knows) but doesn’t perform well on new, unseen data. This defines Overfitting of a model.

Cross-Validation:

Cross-validation is an excellent way to deal with overfitting. Imagine that we have a big dataset, and we divided it into smaller chunks. Then train and test the model multiple times, using different chunks for testing each time. This helps us to see how well our designed  model performs on different parts of the data. This is how the cross validation has been done completely.