Cross Validation

CrossValidation: Cross Validation is a technique which was used to evaluate the performance of particular model like regression or classification model. here we will use K fold cross validation which was mainly used for machine learning and Statistical models .

K-fold Cross Validation: In this we will split the data or divide it into K roughly equal-sized parts or “folds.” Typically, K is a number like 5 or 10, but it can vary based on the requirements.

And then we will train and evaluate the model on the data in the training set for that particular  iteration. we use the model to make predictions on the data in the testing set as well. we will use the performance metric MSE(Mean squared error) that how well the model is performed on the testing set for that specific iteration.

Mean Squared Error(MSE):Mean Squared Error (MSE) is a commonly used performance  metric in statistics and machine learning to measure the average squared difference between the values predicted by a model and the actual  values in a dataset. It defines  how well a predictive model is performing to get the desired outcome with an accuracy .

Linear Regression with two Predictor variables

Linear Regression with Two Predictor Variables:

In this type of regression, we use two predictor variables, X1 and X2, to predict an outcome variable, Y. We’re not only interested in their individual effects but also in how they interact (X1*X2) and if there are any quadratic effects. This allows us to create a more complex model that considers how these variables work together to give desired outcome as Y.

Y = β0 + β1*X1 + β2*X2 + β3*X1*X2 + ε

Where Y is the value that we are going to predict.

X1 and X2 are  two predictor variables. β0 is the intercept term. β1 and β2 are the coefficients for the linear effects of X1 and X2. β3 represents the coefficient for the interaction term (X1*X2). ε is the error term.

Overfitting: Overfitting happens when our model works really well on the data it was trained on (the data it knows) but doesn’t perform well on new, unseen data. This defines Overfitting of a model.

Cross-Validation:

Cross-validation is an excellent way to deal with overfitting. Imagine that we have a big dataset, and we divided it into smaller chunks. Then train and test the model multiple times, using different chunks for testing each time. This helps us to see how well our designed  model performs on different parts of the data. This is how the cross validation has been done completely.

 

 

P Value?

The p-value can be defined the  probability of observing a test statistic as   it was calculated from the data, assuming the null hypothesis is true.

Null Hypothesis (H0): In statistical hypothesis testing, we start with a null hypothesis, which is a statement that there is no effect or no difference. it takes  like a default assumption.

Alternative Hypothesis (Ha): This is the opposite of the null hypothesis which  represents with which we are trying to  prove. It suggests that there is a statistically significant effect or difference.

Test Statistic: To evaluate the null hypothesis, we will  calculate a test statistic from the  data. This statistic test depends on the specific test which we are  performing .

Probability Distribution: we can  compare the test statistic to a probability distribution which is appropriate for the  data and test.

Example:

For example Imagine that we are  having  a magical coin, and if we want   to check if it’s a fair coin or if it’s biased and always lands on heads. 

Null hypothesis can be defined as  that the coin is fair, meaning it has an equal chance of landing on heads or tails each time we flip it.

alternative hypothesis is that the coin is not fair, and it’s biased towards landing on heads. This is what we want to find out.

Now,  flip the coin 10 times, and  keep track of how many times it lands on heads. Let’s say it lands on heads 8 times out of 10 flips.

The p-value is like a special number that helps us to decide if the  coin is fair or not based on the results you got (8 heads out of 10 flips).

If the coin is fair,  expect it to land on heads about 5 times out of 10 flips because it’s a 50-50 chance. So, we use the p-value to see how likely it is to get a result as clear as 8 heads by random chance if the coin is actually fair.

If the p-value is very low (say, less than 0.05), it means it’s very unlikely to get 8 heads by chance if the coin is fair.

But if the p-value is high (say, more than 0.05), it means it’s quite likely to get 8 heads by chance even if the coin is fair. we can  say, this result could happen by luck, so I can’t be sure if the coin is biased or not.”

So, the p-value helps us to  decides a way to measure how sure we can be about our guess (null hypothesis) based on the data we were collected.

Linear Regression

I learnt Linear regression which is a statistical approach allows us study relationship between two continuous variables.

mathematically we can write the expression for linear regression is as follows

Y=α+βX

where Y is a dependent variable and X is an independent variable.

As per the Diabetes Dataset provided, the variables included in this particular dataset as follows %Obesity and %Inactivity.

which X represents %Obesity and Y represents %Inactivity as per the Equation and Dataset Provided.

Linear Regression helps us to predict the Diabetes disease by using the variables %obesity and %Inactivity in a linear way.

For the first step we need to plot the all points which are related to the dataset provided ,this is for analyzing and understanding  the data in an  efficient way for statistical Analysis.