Skip to document

Capstone-predict-diabetes - Jupyter Notebook

The objective is to predict whether or not a patient has diabetes, bas...
Course

python programming (cs8151)

86 Documents
Students shared 86 documents in this course
University

Anna University

Academic year: 2022/2023
Uploaded by:
0followers
1Uploads
0upvotes

Comments

Please sign in or register to post comments.

Preview text

Healthcare

Problem Statement

NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge

about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this

project is originally from NIDDK. The objective is to predict whether or not a patient has diabetes, based on

certain diagnostic measurements included in the dataset. Build a model to accurately predict whether the

patients in the dataset have diabetes or not.

Dataset Description The datasets consists of several medical predictor variables and one target variable

(Outcome). Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin

level, age, and more.

Variables Description Pregnancies Number of times pregnant

Glucose Plasma glucose concentration in an oral glucose tolerance test BloodPressure Diastolic blood pressure (mm Hg)

SkinThickness Triceps skinfold thickness (mm) Insulin Two hour serum insulin

BMI Body Mass Index DiabetesPedigreeFunction Diabetes pedigree function

Age Age in years Outcome Class variable (either 0 or 1). 268 of 768 values are 1, and the others are 0

Data Exploration:

  1. Perform descriptive analysis. Understand the variables and their corresponding values. On the columns below, a value of zero does not make sense and thus indicates missing value:
  • Glucose

  • BloodPressure

  • SkinThickness

  • Insulin

  • BMI

  1. Visually explore these variables using histograms. Treat the missing values accordingly.
  2. There are integer and float data type variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables.

Data Exploration:

  1. Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.
  2. Create scatter charts between the pair of variables to understand the relationships. Describe your findings.
  3. Perform correlation analysis. Visually explore it using a heat map.

Data Modeling:

  1. Devise strategies for model building. It is important to decide the right validation framework. Express your thought process.
  2. Apply an appropriate classification algorithm to build a model. Compare various models with the results from KNN algorithm.

Data Modeling:

  1. Create a classification report by analyzing sensitivity, specificity, AUC (ROC curve), etc. Please be descriptive to explain what values of these parameter you have used.

Data Reporting:

  1. Create a dashboard in tableau by choosing appropriate chart types and metrics useful for the business. The dashboard must entail the following:

a. Pie chart to describe the diabetic or non-diabetic population

b. Scatter charts between relevant variables to analyze the relationships

c. Histogram or frequency charts to analyze the distribution of the data

d. Heatmap of correlation analysis among the relevant variables

e. Create bins of these age values: 20-25, 25-30, 30-35, etc. Analyze different variables for these age brackets using a bubble chart.

In [1]:

/kaggle/input/datasciencecapstonehealthcare/Health - Tableau /kaggle/input/datasciencecapstonehealthcare/Capstone Project 2 - Health Ca re Diabetes /kaggle/input/datasciencecapstonehealthcare/README /kaggle/input/datasciencecapstonehealthcare/Capstone Project 2 - Health Ca re Diabetes /kaggle/input/datasciencecapstonehealthcare/capstone_healthcare_diabetes sv

import libraries

import numpy as np import pandas as pd import matplotlib as plt %matplotlib inline from matplotlib import style import seaborn as sns

import os for dirname, _, filenames in os('/kaggle/input'): for filename in filenames: print(os.path(dirname, filename))

Here we need to predict outcome column. 1 indicate person is diabetes and 0 denote person is non- diabetes.

In [7]:

In [8]:

In [9]:

Out[7]:

Outcome 0 500 1 268 dtype: int

Out[8]:

Pregnancies False Glucose False BloodPressure False SkinThickness False Insulin False BMI False DiabetesPedigreeFunction False Age False Outcome False dtype: bool

<class 'pandas.core.frame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype


0 Pregnancies 768 non-null int
1 Glucose 768 non-null int
2 BloodPressure 768 non-null int
3 SkinThickness 768 non-null int
4 Insulin 768 non-null int
5 BMI 768 non-null float
6 DiabetesPedigreeFunction 768 non-null float
7 Age 768 non-null int
8 Outcome 768 non-null int

dtypes: float64(2), int64(7) memory usage: 54 KB

#Get count of outcome column diabetes_data('Outcome').size()

#checking null value diabetes_data().any()

diabetes_data()

In [10]:

In [11]:

In [12]:

Out[11]:

99 17 100 17 129 14 125 14 106 14 111 14 102 13 95 13 105 13 108 13 Name: Glucose, dtype: int

Out[12]:

0 148 1 85 2 183 3 89 4 137 ... 763 101 764 122 765 121 766 126 767 93 Name: Glucose, Length: 768, dtype: int

#Now lets create a count (frequency) plot describing the data types and the count of var

diabetes_data['Glucose'].value_counts().head( 10 )

diabetes_data['Glucose']

In [15]:

#Drawing Bloodpressure histogram plot_histogram(diabetes_data['BloodPressure'],'BloodPressure Histogram')

Instead of creating historam one by one. With the help of group by and

Outcome we can create all column hisotram

In [16]:

Out[16]:

Outcome 0 [[AxesSubplot(0,0;0.215278x0... 1 [[AxesSubplot(0,0;0.215278x0... dtype: object

diabetes_data('Outcome').hist(figsize=( 14 , 13 ))

In [19]:

In [20]:

In [21]:

After analysing above data we found lots of 0 in Insulin and SkinThickness and removing them or putting mean value will not good dataset. However, we can remove "BloodPressure", "BMI" and "Glucose" zeros row

In [22]:

Total No of zeros found in SkinThickness : 227 Outcome 0 139 1 88 Name: Age, dtype: int

Total No of zeros found in BMI : 11 Outcome 0 9 1 2 Name: Age, dtype: int

Total No of zeros found in Insulin : 374 Outcome 0 236 1 138 Name: Age, dtype: int

(724, 9)

#Checking count of zeros in SkinThickness get_zeros_outcome_count(diabetes_data,'SkinThickness')

#Checking count of zeros in BMI get_zeros_outcome_count(diabetes_data,'BMI')

#Checking count of zeros in BMI get_zeros_outcome_count(diabetes_data,'Insulin')

diabetes_data_mod = diabetes_data[(diabetes_data != 0 ) & (diabetes_data print(diabetes_data_mod)

In [23]:

Data Exploration:

  1. Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.
  2. Create scatter charts between the pair of variables to understand the relationships. Describe your findings.
  3. Perform correlation analysis. Visually explore it using a heat map.

In [24]:

Out[23]:

count mean std min 25% 50% 75%

Pregnancies 724 3 3 0 1 3 6. Glucose 724 121 30 44 99 117 142.

BloodPressure 724 72 12 24 64 72 80. SkinThickness 724 21 15 0 0 24 33.

Insulin 724 84 117 0 0 48 130. BMI 724 32 6 18 27 32 36.

DiabetesPedigreeFunction 724 0 0 0 0 0 0. Age 724 33 11 21 24 29 41.

Outcome 724 0 0 0 0 0 1.

Out[24]:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunc

0 6 148 72 35 0 33 0.

2 8 183 64 0 0 23 0. 4 0 137 40 35 168 43 2.

6 3 78 50 32 88 31 0. 8 2 197 70 45 543 30 0.

#Now we will check the stats of data after removing BloodPressure, BMI and Glucose 0 row diabetes_data_mod().transpose()

#Lets create positive variable and store all 1 value Outcome data Positive = diabetes_data_mod[diabetes_data_mod['Outcome']== 1 ] Positive( 5 )

In [27]:

In [28]:

In [29]:

We don't need to create negative scatter plot, but I am creating it to verify the values and points which we will get for both outcome value using sns scatterplot.

create_scatter_plot(Positive['BloodPressure'],Positive['Glucose'],'BloodPressure','Gluco

#Creating scatter plot for negative outcome Negative = diabetes_data_mod[diabetes_data_mod['Outcome']== 0 ]

create_scatter_plot(Negative['BloodPressure'],Negative['Glucose'],'BloodPressure','Gluco

In [30]:

As you can compare postive & negative scatter plot with sns scatter plot all the value is matching, so now I will create common scatter plot for both outcome.

In [31]:

g =sns(x= "BloodPressure" ,y= "Glucose", hue="Outcome", data=diabetes_data_mod);

B =sns(x= "BMI" ,y= "Insulin", hue="Outcome", data=diabetes_data_mod);

In [34]:

Out[34]:

<AxesSubplot:>

create correlation heat map

plt(figsize=( 8 , 8 )) sns(diabetes_data_mod())

In [35]:

Logistic Regreation and model building

In [36]:

Out[35]:

<AxesSubplot:>

gives correlation value

plt(figsize=( 10 , 10 )) sns(diabetes_data_mod(),annot=True,cmap='viridis')

feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', X = diabetes_data_mod[feature_names] y = diabetes_data_mod

In [41]:

In [42]:

In [43]:

In [44]:

LogisticRegression Score :0. LogisticRegression Accuracy Score :0.

Name Score Accuracy Score 0 LR 0 0. 1 SVC 0 0. 2 KNN 0 0. 3 DT 1 0. 4 GNB 0 0. 5 RF 1 0. 6 GB 0 0.

#now check LR model score and accuracy score

print("LogisticRegression Score :{}".format(model_LR(X_train,y_train))) y_pred = model_LR(X_test) scores = (accuracy_score(y_test, y_pred)) print("LogisticRegression Accuracy Score :{}".format(scores))

accuracyScores = [] modelScores = [] models = [] names = [] #Store algorithm into array to get score and accuracy models(('LR', LogisticRegression(solver='liblinear'))) models(('SVC', SVC())) models(('KNN', KNeighborsClassifier())) models(('DT', DecisionTreeClassifier())) models(('GNB', GaussianNB())) models(('RF', RandomForestClassifier())) models(('GB', GradientBoostingClassifier()))

#We fit each model in a loop and calculate the accuracy of the respective model using th for name, model in models: model(X_train, y_train) modelScores(model(X_train,y_train)) y_pred = model(X_test) accuracyScores(accuracy_score(y_test, y_pred)) names(name)

tr_split_data = pd({'Name': names, 'Score': modelScores,'Accuracy Score': accu print(tr_split_data)

#Lets draw graph to understand more.

In [45]:

Now lets perform K-Fold Cross Validation with Scikit Learn

We will move forward with K-Fold cross validation as it is more accurate and use the data efficiently. We will train the models using 10 fold cross validation and calculate the mean accuracy of the models. "k_fold_cross_val_score" provides its own training and accuracy calculation interface.

In [46]:

Name Score 0 LR 0. 1 SVC 0. 2 KNN 0. 3 DT 0. 4 GNB 0. 5 RF 0. 6 GB 0.

plt(figsize=( 12 , 7 )) axis = sns(x = 'Name', y = 'Accuracy Score', data = tr_split_data) axis(xlabel='Classifier Name', ylabel='Accuracy Score') for p in axis: height = p_height() axis(p_x() + p_width()/ 2 , height + 0, '{:1}'.format(height), ha=

plt()

names = [] scores = [] for name, model in models: kfold = KFold(n_splits= 10 , random_state= 10 ) score = cross_val_score(model, X, y, cv=kfold, scoring='accuracy').mean() names(name) scores(score) k_fold_cross_val_score = pd({'Name': names, 'Score': scores}) print(k_fold_cross_val_score)

Was this document helpful?

Capstone-predict-diabetes - Jupyter Notebook

Course: python programming (cs8151)

86 Documents
Students shared 86 documents in this course

University: Anna University

Was this document helpful?
7/18/23, 12:14 PM
capstone-predict-diabetes - Jupyter Notebook
localhost:8888/notebooks/capstone-predict-diabetes.ipynb#
1/27
Healthcare
Problem Statement
NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge
about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this
project is originally from NIDDK. The objective is to predict whether or not a patient has diabetes, based on
certain diagnostic measurements included in the dataset. Build a model to accurately predict whether the
patients in the dataset have diabetes or not.
Dataset Description The datasets consists of several medical predictor variables and one target variable
(Outcome). Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin
level, age, and more.
Variables Description
Pregnancies Number of times pregnant
Glucose Plasma glucose concentration in an oral glucose tolerance test
BloodPressure Diastolic blood pressure (mm Hg)
SkinThickness Triceps skinfold thickness (mm)
Insulin Two hour serum insulin
BMI Body Mass Index
DiabetesPedigreeFunction Diabetes pedigree function
Age Age in years
Outcome Class variable (either 0 or 1). 268 of 768 values are 1, and the others are 0
Data Exploration:
1. Perform descriptive analysis. Understand the variables and their corresponding values. On the columns
below, a value of zero does not make sense and thus indicates missing value:
• Glucose
• BloodPressure
• SkinThickness
• Insulin
• BMI
2. Visually explore these variables using histograms. Treat the missing values accordingly.
3. There are integer and float data type variables in this dataset. Create a count (frequency) plot
describing the data types and the count of variables.
Data Exploration:
1. Check the balance of the data by plotting the count of outcomes by their value. Describe your findings
and plan future course of action.
2. Create scatter charts between the pair of variables to understand the relationships. Describe your
findings.
3. Perform correlation analysis. Visually explore it using a heat map.