Hey Everyone,
I recently delved into the data from the Open CrossFit Games, focusing on the performances of the top 12,500 female athletes. Here's a breakdown of what I found:
Registered by continent:
https://preview.redd.it/ie8ixv1b5ip...bp&s=183c33dc381a822920dc27e15e03b8704cb7a2f6
Registered by country:
https://preview.redd.it/6o2temje5ip...bp&s=00135b4f3b3868367638aee3078649b72f2b2672
Ranking according to score by continent:
https://preview.redd.it/n5goshgtcip...bp&s=b0dc55aa1058a9a1d60cf034aa6dba1b4e0701d0
Data Distribution: Before analyzing the relationships between different variables and their significance, it's important to understand the distribution of the data. Below are boxplots illustrating the minimum (Min), first quartile (Q1), median (Med), third quartile (Q3), and maximum (Max) values for each of the key variables, note that for the three WODs, the unit of measurement considered is repetitions per second (rep/s):
https://preview.redd.it/otvkddba7ip...bp&s=97aa87aee17df5018e6dc9ca7b323f3195f73031
Correlation Analysis: Now, let's examine the correlations between these variables:
https://preview.redd.it/t0ctmn4k7ip...bp&s=c7d820170a3c92090c82e6a58aee96ba6fc65653
Wod 24.1: Moderate positive correlation with Wod 24.2 (r = 0.54) and strong positive correlation with Wod 24.3 (r = 0.67). Weak negative correlations with height (r = -0.12), weight (r = -0.05), and age (r = -0.16).
Wod 24.2: Moderate positive correlation with Wod 24.1 (r = 0.54) and Wod 24.3 (r = 0.53). Moderate positive correlations with height (r = 0.19) and weight (r = 0.25). Weak negative correlation with age (r = -0.16).
Wod 24.3: Strong positive correlation with Wod 24.1 (r = 0.67) and moderate positive correlation with Wod 24.2 (r = 0.53). Weak negative correlations with height (r = -0.07), weight (r = 0.04), and age (r = -0.17).
Significance Analysis:
To assess the importance of different variables in predicting the overall score, a multiple linear regression model was employed. The model's performance was evaluated using the following metrics:
Coefficient of Determination (R^2): 80%
P-values for Regression Coefficients:
Age: The p-value associated with age is extremely small (1.74e-29), indicating a strong significance in the model.
Height [in]: The p-value for height is 0.857, suggesting that this variable is not significant in predicting the outcome.
Weight [lb]: The p-value associated with weight is 1.42e-06, indicating statistical significance.
Wod 1, Wod 2, Wod 3: The p-values for the results of the WODs are all very small (0.0), indicating strong statistical significance for these variables.
Importance of Features According to the Decision Tree (R^2): 99%
Age, Height [in], Weight [lb]: These variables have low importance according to the decision tree, with values around 0.0006.
Wod 1, Wod 2, Wod 3: The results of the WODs are identified as the most important features in predicting the outcome, with Wod 3 having the highest importance value of approximately 0.573.
I recently delved into the data from the Open CrossFit Games, focusing on the performances of the top 12,500 female athletes. Here's a breakdown of what I found:
Registered by continent:
https://preview.redd.it/ie8ixv1b5ip...bp&s=183c33dc381a822920dc27e15e03b8704cb7a2f6
Registered by country:
https://preview.redd.it/6o2temje5ip...bp&s=00135b4f3b3868367638aee3078649b72f2b2672
Ranking according to score by continent:
https://preview.redd.it/n5goshgtcip...bp&s=b0dc55aa1058a9a1d60cf034aa6dba1b4e0701d0
Data Distribution: Before analyzing the relationships between different variables and their significance, it's important to understand the distribution of the data. Below are boxplots illustrating the minimum (Min), first quartile (Q1), median (Med), third quartile (Q3), and maximum (Max) values for each of the key variables, note that for the three WODs, the unit of measurement considered is repetitions per second (rep/s):
https://preview.redd.it/otvkddba7ip...bp&s=97aa87aee17df5018e6dc9ca7b323f3195f73031
Correlation Analysis: Now, let's examine the correlations between these variables:
https://preview.redd.it/t0ctmn4k7ip...bp&s=c7d820170a3c92090c82e6a58aee96ba6fc65653
Wod 24.1: Moderate positive correlation with Wod 24.2 (r = 0.54) and strong positive correlation with Wod 24.3 (r = 0.67). Weak negative correlations with height (r = -0.12), weight (r = -0.05), and age (r = -0.16).
Wod 24.2: Moderate positive correlation with Wod 24.1 (r = 0.54) and Wod 24.3 (r = 0.53). Moderate positive correlations with height (r = 0.19) and weight (r = 0.25). Weak negative correlation with age (r = -0.16).
Wod 24.3: Strong positive correlation with Wod 24.1 (r = 0.67) and moderate positive correlation with Wod 24.2 (r = 0.53). Weak negative correlations with height (r = -0.07), weight (r = 0.04), and age (r = -0.17).
Significance Analysis:
To assess the importance of different variables in predicting the overall score, a multiple linear regression model was employed. The model's performance was evaluated using the following metrics:
Coefficient of Determination (R^2): 80%
P-values for Regression Coefficients:
Age: The p-value associated with age is extremely small (1.74e-29), indicating a strong significance in the model.
Height [in]: The p-value for height is 0.857, suggesting that this variable is not significant in predicting the outcome.
Weight [lb]: The p-value associated with weight is 1.42e-06, indicating statistical significance.
Wod 1, Wod 2, Wod 3: The p-values for the results of the WODs are all very small (0.0), indicating strong statistical significance for these variables.
Importance of Features According to the Decision Tree (R^2): 99%
Age, Height [in], Weight [lb]: These variables have low importance according to the decision tree, with values around 0.0006.
Wod 1, Wod 2, Wod 3: The results of the WODs are identified as the most important features in predicting the outcome, with Wod 3 having the highest importance value of approximately 0.573.