Introduction to Pandas and ML
Here is how Mort started on this assignment by asking ChatGPT … Regarding Python Pandas, what are some data sets that would be good for learning Pandas?
-
There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop.
Here are some suggestions…
-
Titanic Dataset: This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.
-
Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.
-
NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.
-
Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.
-
COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.
-
World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.
-
-
These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.
Follow up question, “where can I find Titanic data set?”
-
The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset…
-
Kaggle: The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.
-
UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.
-
Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:
-
Titanic Libraries
Using the Titanic dataset will require importing data.
import seaborn as sns
titanic_data = sns.load_dataset('titanic')
# Uncomment the following lines to install the required packages
!pip install seaborn
!pip install pandas
!pip install scikit-learn
Defaulting to user installation because normal site-packages is not writeable
Collecting seaborn
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 KB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hRequirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/lib/python3/dist-packages (from seaborn) (3.5.1)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /home/iwu88/.local/lib/python3.10/site-packages (from seaborn) (1.25.2)
Requirement already satisfied: pandas>=1.2 in /home/iwu88/.local/lib/python3.10/site-packages (from seaborn) (2.1.0)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas>=1.2->seaborn) (2022.1)
Requirement already satisfied: tzdata>=2022.1 in /home/iwu88/.local/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/iwu88/.local/lib/python3.10/site-packages (from pandas>=1.2->seaborn) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas>=1.2->seaborn) (1.16.0)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in /home/iwu88/.local/lib/python3.10/site-packages (2.1.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/iwu88/.local/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in /home/iwu88/.local/lib/python3.10/site-packages (from pandas) (2023.3)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas) (2022.1)
Requirement already satisfied: numpy>=1.22.4 in /home/iwu88/.local/lib/python3.10/site-packages (from pandas) (1.25.2)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: numpy<2.0,>=1.19.5 in /home/iwu88/.local/lib/python3.10/site-packages (from scikit-learn) (1.25.2)
Requirement already satisfied: joblib>=1.2.0 in /home/iwu88/.local/lib/python3.10/site-packages (from scikit-learn) (1.3.2)
Collecting threadpoolctl>=2.0.0
Downloading threadpoolctl-3.3.0-py3-none-any.whl (17 kB)
Requirement already satisfied: scipy>=1.6.0 in /usr/lib/python3/dist-packages (from scikit-learn) (1.8.0)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.4.1.post1 threadpoolctl-3.3.0
Titanic Data
Look at a sample of data.
import seaborn as sns
# Load the titanic dataset
titanic_data = sns.load_dataset('titanic')
print("Titanic Data")
print(titanic_data.columns) # titanic data set
display(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town', 'alone']]) # look at selected columns
/usr/lib/python3/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.25.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Titanic Data
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
'alive', 'alone'],
dtype='object')
survived | pclass | sex | age | sibsp | parch | class | fare | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | Third | 7.2500 | Southampton | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | First | 71.2833 | Cherbourg | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | Third | 7.9250 | Southampton | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | First | 53.1000 | Southampton | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | Third | 8.0500 | Southampton | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | male | 27.0 | 0 | 0 | Second | 13.0000 | Southampton | True |
887 | 1 | 1 | female | 19.0 | 0 | 0 | First | 30.0000 | Southampton | True |
888 | 0 | 3 | female | NaN | 1 | 2 | Third | 23.4500 | Southampton | False |
889 | 1 | 1 | male | 26.0 | 0 | 0 | First | 30.0000 | Cherbourg | True |
890 | 0 | 3 | male | 32.0 | 0 | 0 | Third | 7.7500 | Queenstown | True |
891 rows × 10 columns
Clean Titanic Data
This is called ‘Cleaning’ data.
Most analysis, like Machine Learning require data to be in standardized format…
- All data needs to be numeric
- Some data is removed, as it is not useable in study
- Sex and alone is changed to binary
- The embark data is split into multiple columns
import pandas as pd
# Preprocess the data
from sklearn.preprocessing import OneHotEncoder
td = titanic_data
td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
td.dropna(inplace=True) # drop rows with at least one missing value, after dropping unuseful columns
td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)
td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)
# Encode categorical variables
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(td[['embarked']])
onehot = enc.transform(td[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
td[cols] = pd.DataFrame(onehot)
td.drop(['embarked'], axis=1, inplace=True)
td.dropna(inplace=True) # drop rows with at least one missing value, after preparing the data
print(td.columns)
display(td)
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'alone',
'embarked_C', 'embarked_Q', 'embarked_S'],
dtype='object')
survived | pclass | sex | age | sibsp | parch | fare | alone | embarked_C | embarked_Q | embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | 0 | 0.0 | 0.0 | 1.0 |
1 | 1 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | 0 | 1.0 | 0.0 | 0.0 |
2 | 1 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | 1 | 0.0 | 0.0 | 1.0 |
3 | 1 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | 0 | 0.0 | 0.0 | 1.0 |
4 | 0 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | 1 | 0.0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
705 | 0 | 2 | 1 | 39.0 | 0 | 0 | 26.0000 | 1 | 0.0 | 0.0 | 1.0 |
706 | 1 | 2 | 0 | 45.0 | 0 | 0 | 13.5000 | 1 | 0.0 | 0.0 | 1.0 |
707 | 1 | 1 | 1 | 42.0 | 0 | 0 | 26.2875 | 1 | 0.0 | 1.0 | 0.0 |
708 | 1 | 1 | 0 | 22.0 | 0 | 0 | 151.5500 | 1 | 0.0 | 0.0 | 1.0 |
710 | 1 | 1 | 0 | 24.0 | 0 | 0 | 49.5042 | 1 | 1.0 | 0.0 | 0.0 |
564 rows × 11 columns
Train Titanic Data
The result of ‘Training’ data is making it easier to analyze or make conclusions.
What conclusions can you make using min, max, means statistics bout the following…
- Given that 1-male and 0-femaale, what gender is more likely to suvive?
- Can you make an conclusions on fare?
- Can you make any conclusions on being alone?
Median Values
print(titanic_data.median())
survived 0.0
pclass 2.0
sex 1.0
age 28.0
sibsp 0.0
parch 0.0
fare 16.1
alone 1.0
embarked_C 0.0
embarked_Q 0.0
embarked_S 1.0
dtype: float64
Perished Mean/Average
print(titanic_data.query("survived == 0").mean())
survived 0.000000
pclass 2.464072
sex 0.844311
age 31.073353
sibsp 0.562874
parch 0.398204
fare 24.835902
alone 0.616766
embarked_C 0.185629
embarked_Q 0.038922
embarked_S 0.775449
dtype: float64
Survived Mean/Average
print(td.query("survived == 1").mean())
survived 1.000000
pclass 1.878261
sex 0.326087
age 28.481522
sibsp 0.504348
parch 0.508696
fare 50.188806
alone 0.456522
embarked_C 0.152174
embarked_Q 0.034783
embarked_S 0.813043
dtype: float64
Survived Max and Min Stats
print("maximums for survivors")
print(td.query("survived == 1").max())
print()
print("minimums for survivors")
print(td.query("survived == 1").min())
maximums for survivors
survived 1.0000
pclass 3.0000
sex 1.0000
age 80.0000
sibsp 4.0000
parch 5.0000
fare 512.3292
alone 1.0000
embarked_C 1.0000
embarked_Q 1.0000
embarked_S 1.0000
dtype: float64
minimums for survivors
survived 1.00
pclass 1.00
sex 0.00
age 0.75
sibsp 0.00
parch 0.00
fare 0.00
alone 0.00
embarked_C 0.00
embarked_Q 0.00
embarked_S 0.00
dtype: float64
Machine Learning
Scikit-learn is a powerful Python library for machine learning, offering tools for classification, regression, clustering, and dimensionality reduction.
-
The Titanic dataset is a classic for data analysis and machine learning. We’ll use machine learning techniques like Decision Trees and Logistic Regression to predict passenger survival.
-
Decision Trees are a type of model used for both classification and regression. They work by creating a tree-like model of decisions based on the features. For example, in the context of the Titanic dataset, a Decision Tree might make decisions based on features like ‘age’, ‘sex’, and ‘fare’ to predict whether a passenger survived. The tree might first split by ‘sex’, then for each sex, split by ‘age’, and so on, creating a tree of decisions.
-
Logistic Regression is a statistical model used in machine learning for binary classification problems. It models the probabilities of the default class (e.g., the probability of a passenger surviving, in the context of the Titanic dataset).
-
After training our models, we’ll evaluate their performance using accuracy, the percentage of correct predictions on unseen data.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Build distinct data frames on survived column
X = td.drop('survived', axis=1) # all except 'survived'
y = td['survived'] # only 'survived'
# Split arrays in random train 70%, random test 30%, using stratified sampling (same proportion of survived in both sets) and a fixed random state (42
# The number 42 is often used in examples and tutorials because of its cultural significance in fields like science fiction (it's the "Answer to the Ultimate Question of Life, The Universe, and Everything" in The Hitchhiker's Guide to the Galaxy by Douglas Adams). But in practice, the actual value doesn't matter; what's important is that it's set to a consistent value.
# X_train is the DataFrame containing the features for the training set.
# X_test is the DataFrame containing the features for the test set.
# y-train is the 'survived' status for each passenger in the training set, corresponding to the X_train data.
# y_test is the 'survived' status for each passenger in the test set, corresponding to the X_test data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy: {:.2%}'.format(accuracy))
# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy: {:.2%}'.format(accuracy))
DecisionTreeClassifier Accuracy: 77.06%
LogisticRegression Accuracy: 78.82%
/home/iwu88/.local/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Predicting Survival
So, now we are ready to play the game… “Would I have survived the Titanic?”.
Insert your own data in the code. Look at your analysis and consider how you would travel today.
- Data description:
- pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- name - Name
- sex - male or female
- age - number of year
- sibsp - number of Siblings/Spouses Aboard
- parch - number of Parents/Children Aboard
- fare - passenger fare 0 to 512
- embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- alone - boolean True or False
import numpy as np
# Logistic regression model is used to predict the probability
# Define a new passenger
passenger = pd.DataFrame({
'name': [''],
'pclass': [0],
'sex': ['female'],
'age': [0],
'sibsp': [0],
'parch': [0],
'fare': [512],
'embarked': ['S'],
'alone': [False]
})
display(passenger)
new_passenger = passenger.copy()
# Preprocess the new passenger data
new_passenger['sex'] = new_passenger['sex'].apply(lambda x: 1 if x == 'male' else 0)
new_passenger['alone'] = new_passenger['alone'].apply(lambda x: 1 if x == True else 0)
# Encode 'embarked' variable
onehot = enc.transform(new_passenger[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
new_passenger[cols] = pd.DataFrame(onehot, index=new_passenger.index)
new_passenger.drop(['name'], axis=1, inplace=True)
new_passenger.drop(['embarked'], axis=1, inplace=True)
display(new_passenger)
# Predict the survival probability for the new passenger
dead_proba, alive_proba = np.squeeze(logreg.predict_proba(new_passenger))
# Print the survival probability
print('Death probability: {:.2%}'.format(dead_proba))
print('Survival probability: {:.2%}'.format(alive_proba))
name | pclass | sex | age | sibsp | parch | fare | embarked | alone | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | female | 0 | 0 | 0 | 512 | S | False |
pclass | sex | age | sibsp | parch | fare | alone | embarked_C | embarked_Q | embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 512 | 0 | 0.0 | 0.0 | 1.0 |
Death probability: 0.34%
Survival probability: 99.66%
Improve your chances
Is there anything you could do to improve your chances?
# Decision tree model is used to determine the importance of each feature
importances = dt.feature_importances_
for feature, importance in zip(new_passenger.columns, importances):
print(f'The importance of {feature} is: {importance}')
The importance of pclass is: 0.15113965322843248
The importance of sex is: 0.27345943069742495
The importance of age is: 0.24268504798937082
The importance of sibsp is: 0.04302116883813218
The importance of parch is: 0.013914855333419263
The importance of fare is: 0.2525437299188138
The importance of alone is: 0.0
The importance of embarked_C is: 0.011151798192078404
The importance of embarked_Q is: 0.0
The importance of embarked_S is: 0.012084315802328143