Note: Import the iris dataset keeping the text intact.
Solution
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')
# Print the first 3 rows
iris[:3]
#> array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
#> [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
#> [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa']], dtype=object)
Note: Extract the text column species from the 1D iris imported in previous question.
Input:
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)
Solution
# Input:
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)
print(iris_1d.shape)
# Solution:
species = np.array([row[4] for row in iris_1d])
species[:5]
#> (150,)
#> array([b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa',
#> b'Iris-setosa'],
#> dtype='|S18')
Note: Convert the 1D iris to 2D array iris_2d by omitting the species text field.
Input:
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)
Solution
# Input:
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)
# Output:
# Method 1: Convert each row to a list and get the first 4 items
iris_2d = np.array([row.tolist()[:4] for row in iris_1d])
iris_2d[:4]
# Alt Method 2: Import only the first 4 columns from source url
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[:4]
#> array([[ 5.1, 3.5, 1.4, 0.2],
#> [ 4.9, 3. , 1.4, 0.2],
#> [ 4.7, 3.2, 1.3, 0.2],
#> [ 4.6, 3.1, 1.5, 0.2]])
Note: Find the mean, median, standard deviation of iris's sepallength (1st column)
Input
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
Solution
# Input
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])
# Output
mu, med, sd = np.mean(sepallength), np.median(sepallength), np.std(sepallength)
print(mu, med, sd)
#> 5.84333333333 5.8 0.825301291785
Note: Create a normalized form of iris's sepallength whose values range exactly between 0 and 1 so that the minimum has value 0 and maximum has value 1.
Input
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])
Solution
# Input
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])
# Output
Smax, Smin = sepallength.max(), sepallength.min()
S = (sepallength - Smin)/(Smax - Smin)
print(S)
Note: Find the 5th and 95th percentile of iris's sepallength.
Input:
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])
Solution
# Input
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])
# Output
np.percentile(sepallength, q=[5, 95])
#> array([ 4.6 , 7.255])
Insert np.nan values at 20 random positions in iris_2d dataset
# Input
url = 'https://pythontraining.dzone.co.in/tutorial/exercises/numpy/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')
Solution
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')
# Method 1
i, j = np.where(iris_2d)
# i, j contain the row numbers and column numbers of 600 elements of iris_x
np.random.seed(100)
iris_2d[np.random.choice((i), 20), np.random.choice((j), 20)] = np.nan
# Method 2
np.random.seed(100)
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan
# Print first 10 rows
print(iris_2d[:10])
Note: Find the number and position of missing values in iris_2d's sepallength (1st column).
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float')
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan
Solution
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan
# Solution
print("Number of missing values: \n", np.isnan(iris_2d[:, 0]).sum())
print("Position of missing values: \n", np.where(np.isnan(iris_2d[:, 0])))
#> Number of missing values:
#> 5
#> Position of missing values:
#> (array([ 39, 88, 99, 130, 147]),)
Note: Filter the rows of iris_2d that has petallength (3rd column) > 1.5 and sepallength (1st column) < 5.0
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
Solution
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
# Output
condition = (iris_2d[:, 2] > 1.5) & (iris_2d[:, 0] < 5.0)
iris_2d[condition]
Note: Select the rows of iris_2d that does not have any nan value.
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
Solution
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan
# Output
# No direct numpy function for this.
# Method 1:
any_nan_in_row = np.array([~np.any(np.isnan(row)) for row in iris_2d])
iris_2d[any_nan_in_row][:5]
# Method 2: (By Rong)
iris_2d[np.sum(np.isnan(iris_2d), axis = 1) == 0][:5]