R Markdown
Question#1 Solution
> summary(HealthInsurance)
X health age limit gender
Min. : 1 Length:8802 Min. :18.00 Length:8802 Length:8802
1st Qu.:2201 Class :character 1st Qu.:30.00 Class :character Class :character
Median :4402 Mode :character Median :39.00 Mode :character Mode :character
Mean :4402 Mean :38.94
3rd Qu.:6602 3rd Qu.:48.00
Max. :8802 Max. :62.00
insurance married selfemp family region
Length:8802 Length:8802 Length:8802 Min. : 1.000 Length:8802
Class :character Class :character Class :character 1st Qu.: 2.000 Class :character
Mode :character Mode :character Mode :character Median : 3.000 Mode :character
Mean : 3.094
3rd Qu.: 4.000
Max. :14.000
ethnicity education
Length:8802 Length:8802
Class :character Class :character
Mode :character Mode :character
> #Age Attribute
> mean(HealthInsurance$age)
[1] 38.93683
> median(HealthInsurance$age)
[1] 39
>
> #Family Attribute
> mean(HealthInsurance$family)
[1] 3.093501
> median(HealthInsurance$family)
[1] 3
Question#2: Create a new data frame with a subset of the columns and rows. Make sure to rename it.
New_Df <- subset(HealthInsurance, select = c(health, age, limit, gender, insurance, married, selfemp, family, region, ethnicity, education))
New_Df
Question#2 Solution
health age limit gender insurance married selfemp family region ethnicity education
1 yes 31 no male yes yes yes 4 south cauc bachelor
2 yes 31 no female yes yes no 4 south cauc highschool
3 yes 54 no male yes yes no 5 west cauc ged
4 yes 27 no male yes no no 5 west cauc highschool
5 yes 39 no male yes yes no 5 west cauc none
6 yes 32 no female no no no 3 south afam bachelor
7 no 56 yes female yes yes no 2 west cauc highschool
8 yes 60 no female yes yes no 2 south cauc highschool
9 yes 62 no male yes yes no 2 south cauc highschool
10 yes 52 no female no yes no 2 northeast afam highschool
11 yes 50 no male yes yes no 2 northeast afam highschool
12 yes 44 no male yes no no 1 midwest cauc master
13 yes 26 no female no yes yes 3 south cauc phd
14 yes 38 yes male no yes no 7 west cauc none
15 yes 48 no male yes yes no 8 south cauc other
16 yes 48 no female yes yes no 8 south cauc none
17 yes 53 no male no yes yes 2 midwest cauc none
18 yes 50 no female no yes no 2 midwest cauc highschool
19 yes 23 no female no no no 1 midwest cauc highschool
20 yes 43 no female yes no no 2 south cauc highschool
21 yes 40 no female yes no no 4 midwest cauc highschool
22 yes 25 no male no no no 1 west cauc highschool
23 yes 27 no male no yes no 3 west cauc bachelor
24 yes 53 no male yes yes no 6 northeast afam none
25 yes 40 no female yes yes no 6 northeast afam none
26 no 22 yes female yes no no 2 northeast afam highschool
27 yes 51 yes male no yes no 3 west cauc none
28 no 27 no male no no no 3 west cauc none
29 no 58 yes female no yes no 3 west cauc none
30 yes 45 no male yes yes no 4 west cauc none
31 yes 18 no male yes no no 4 west cauc none
32 yes 31 no male yes yes no 2 south cauc other
33 yes 31 no female yes yes no 2 south cauc highschool
34 yes 27 no male no no no 1 south cauc bachelor
35 yes 61 no male yes no no 1 south cauc highschool
36 yes 33 yes female yes no no 1 west cauc bachelor
37 yes 40 no female yes no no 1 west cauc bachelor
38 no 43 yes male yes yes no 4 midwest cauc highschool
39 yes 44 no female yes yes no 4 midwest cauc bachelor
40 no 55 no male yes yes yes 2 south cauc highschool
41 yes 52 no female yes yes no 2 south cauc ged
42 yes 30 no female yes yes no 5 south cauc highschool
43 yes 33 no male yes yes no 5 south cauc highschool
44 yes 28 no male yes yes no 3 south cauc none
45 no 28 no female yes yes no 3 south cauc highschool
46 yes 31 no male yes yes no 2 west cauc highschool
47 yes 39 yes female no yes yes 2 west cauc other
48 yes 22 no male no no no 6 south cauc highschool
49 yes 21 no female no no no 6 south cauc highschool
50 yes 42 no female no yes yes 3 midwest afam highschool
51 no 61 yes female yes no no 3 northeast cauc highschool
52 yes 36 no male yes yes no 4 west cauc highschool
53 yes 27 no male yes yes no 3 west cauc bachelor
54 yes 46 no male yes yes no 4 midwest cauc highschool
55 yes 38 no female yes yes no 4 midwest cauc highschool
56 yes 29 no male yes yes no 5 west other none
57 yes 33 no male no no no 1 west cauc bachelor
58 yes 60 no male yes yes no 2 south afam other
59 yes 60 no female yes yes no 2 south afam other
60 yes 35 no male yes no no 1 west other highschool
61 yes 30 no female yes no no 3 midwest afam highschool
62 yes 61 yes male yes yes yes 2 midwest cauc highschool
63 yes 58 no female yes yes no 2 midwest cauc highschool
64 yes 46 no female yes no no 3 south afam highschool
65 yes 55 no female no no no 1 west cauc highschool
66 yes 20 yes female no no no 3 midwest cauc highschool
67 yes 44 no male yes yes no 3 midwest cauc highschool
68 yes 43 no female yes yes no 3 midwest cauc highschool
69 yes 26 yes male yes yes no 3 south cauc highschool
70 yes 45 no male yes yes yes 5 midwest cauc other
71 yes 39 no female yes yes yes 5 midwest cauc other
72 yes 21 no male yes no no 5 midwest cauc highschool
73 yes 53 no female yes no no 2 south cauc other
74 yes 18 no male yes no no 2 south cauc none
75 yes 31 no male yes yes no 2 south cauc highschool
76 yes 33 no female no yes no 2 south cauc highschool
77 yes 27 no female yes no no 2 south cauc other
78 yes 47 no male yes yes no 4 midwest cauc highschool
79 yes 55 no male yes yes no 2 midwest cauc master
80 yes 53 no female yes yes no 2 midwest cauc highschool
81 yes 29 no female yes yes no 2 midwest cauc master
82 yes 34 no male yes yes no 4 midwest cauc highschool
83 yes 34 no female yes yes no 4 midwest cauc other
84 no 38 yes female yes no no 1 west cauc highschool
85 yes 49 no male yes yes no 2 midwest cauc highschool
86 yes 32 no female yes yes no 2 midwest cauc other
87 yes 47 no male no yes no 10 south cauc none
88 yes 20 no male no yes no 10 south cauc none
89 yes 53 no female yes no no 1 midwest cauc highschool
90 yes 56 no male yes no yes 2 northeast cauc highschool
[ reached 'max' / getOption("max.print") -- omitted 8712 rows ]
Question#3: Create new column names for the new data frame.
colnames(New_Df) <- c('Healthy', 'AGE', 'Insurance_Limit', 'Gender' , 'Qualified_Insurance', 'Marital_Status', 'Self_Employed', 'Household_Size', 'Geo_Location', 'Ethnicity', 'Education_Level')
New_Df
Question#3 Solution (Partial Solution shown for the first 20 entries of each attribute)
Healthy AGE Insurance_Limit Gender Qualified_Insurance Marital_Status Self_Employed
1 yes 31 no male yes yes yes
2 yes 31 no female yes yes no
3 yes 54 no male yes yes no
4 yes 27 no male yes no no
5 yes 39 no male yes yes no
6 yes 32 no female no no no
7 no 56 yes female yes yes no
8 yes 60 no female yes yes no
9 yes 62 no male yes yes no
10 yes 52 no female no yes no
11 yes 50 no male yes yes no
12 yes 44 no male yes no no
13 yes 26 no female no yes yes
14 yes 38 yes male no yes no
15 yes 48 no male yes yes no
16 yes 48 no female yes yes no
17 yes 53 no male no yes yes
18 yes 50 no female no yes no
19 yes 23 no female no no no
20 yes 43 no female yes no no
Household_Size Geo_Location Ethnicity Education_Level
1 4 south cauc bachelor
2 4 south cauc highschool
3 5 west cauc ged
4 5 west cauc highschool
5 5 west cauc none
6 3 south afam bachelor
7 2 west cauc highschool
8 2 south cauc highschool
9 2 south cauc highschool
10 2 northeast afam highschool
11 2 northeast afam highschool
12 1 midwest cauc master
13 3 south cauc phd
14 7 west cauc none
15 8 south cauc other
16 8 south cauc none
17 2 midwest cauc none
18 2 midwest cauc highschool
19 1 midwest cauc highschool
20 2 south cauc highschool
Question#4 Solution
> summary(New_Df)
Healthy AGE Insurance_Limit Gender Qualified_Insurance
Length:8802 Min. :18.00 Length:8802 Length:8802 Length:8802
Class :character 1st Qu.:30.00 Class :character Class :character Class :character
Mode :character Median :39.00 Mode :character Mode :character Mode :character
Mean :38.94
3rd Qu.:48.00
Max. :62.00
Marital_Status Self_Employed Household_Size Geo_Location Ethnicity
Length:8802 Length:8802 Min. : 1.000 Length:8802 Length:8802
Class :character Class :character 1st Qu.: 2.000 Class :character Class :character
Mode :character Mode :character Median : 3.000 Mode :character Mode :character
Mean : 3.094
3rd Qu.: 4.000
Max. :14.000
Education_Level
Length:8802
Class :character
Mode :character
> #Age Attribute
> mean(New_Df$AGE)
[1] 38.93683
> median(New_Df$AGE)
[1] 39
> #Family Attribute
> mean(New_Df$Household_Size)
[1] 3.093501
> median(New_Df$Household_Size)
[1] 3
For the same attributes of Age and Family size in both the original data frame and new data frame we made, there is no difference in the mean and median values. We will get a mean of 38.93683 and median of 39 for Age and a mean of 3.093501 and median of 3 for Family. If we decided to modify our new data frame to show specific data, we can include a condition to do just that. For example, we can say we want to only consider individuals in the column Health to be "Yes" and exclude individuals marked as "No". In our existing data frame, we would need to include "health == yes". The query would look like:
New_Df <- subset(HealthInsurance, health == 'yes', select = c(age, limit, gender, insurance, married, selfemp, family, region, ethnicity, education))
New_Df
Question#5: For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e”in one column. Rename those values so that all 20 would show as “excellent”.
New_Df$Education_Level <- with(New_Df, replace(Education_Level, Education_Level == 'highschool', 'HighSchool_Diploma'))
New_Df
New_Df$Education_Level <- with(New_Df, replace(Education_Level, Education_Level == 'bachelor', 'Bachelors_Degree'))
New_Df
New_Df$Education_Level <- with(New_Df, replace(Education_Level, Education_Level == 'master', 'Masters_Degree'))
New_Df
Question#5 Solution (Partial Solution shown for the first 20 entries of each attribute)
Healthy AGE Insurance_Limit Gender Qualified_Insurance Marital_Status Self_Employed
1 yes 31 no male yes yes yes
2 yes 31 no female yes yes no
3 yes 54 no male yes yes no
4 yes 27 no male yes no no
5 yes 39 no male yes yes no
6 yes 32 no female no no no
7 no 56 yes female yes yes no
8 yes 60 no female yes yes no
9 yes 62 no male yes yes no
10 yes 52 no female no yes no
11 yes 50 no male yes yes no
12 yes 44 no male yes no no
13 yes 26 no female no yes yes
14 yes 38 yes male no yes no
15 yes 48 no male yes yes no
16 yes 48 no female yes yes no
17 yes 53 no male no yes yes
18 yes 50 no female no yes no
19 yes 23 no female no no no
20 yes 43 no female yes no no
Household_Size Geo_Location Ethnicity Education_Level
1 4 south cauc Bachelors_Degree
2 4 south cauc HighSchool_Diploma
3 5 west cauc ged
4 5 west cauc HighSchool_Diploma
5 5 west cauc none
6 3 south afam Bachelors_Degree
7 2 west cauc HighSchool_Diploma
8 2 south cauc HighSchool_Diploma
9 2 south cauc HighSchool_Diploma
10 2 northeast afam HighSchool_Diploma
11 2 northeast afam HighSchool_Diploma
12 1 midwest cauc Masters_Degree
13 3 south cauc phd
14 7 west cauc none
15 8 south cauc other
16 8 south cauc none
17 2 midwest cauc none
18 2 midwest cauc HighSchool_Diploma
19 1 midwest cauc HighSchool_Diploma
20 2 south cauc HighSchool_Diploma
Question#6: Display enough rows to see examples of all of steps 1-5 above.
#print(New_Df) this will show the data set up to the max threshold
head(New_Df, 20)
tail(New_Df, 20)
Question#6 Solution
> head(New_Df, 20)
Healthy AGE Insurance_Limit Gender Qualified_Insurance Marital_Status Self_Employed
1 yes 31 no male yes yes yes
2 yes 31 no female yes yes no
3 yes 54 no male yes yes no
4 yes 27 no male yes no no
5 yes 39 no male yes yes no
6 yes 32 no female no no no
7 no 56 yes female yes yes no
8 yes 60 no female yes yes no
9 yes 62 no male yes yes no
10 yes 52 no female no yes no
11 yes 50 no male yes yes no
12 yes 44 no male yes no no
13 yes 26 no female no yes yes
14 yes 38 yes male no yes no
15 yes 48 no male yes yes no
16 yes 48 no female yes yes no
17 yes 53 no male no yes yes
18 yes 50 no female no yes no
19 yes 23 no female no no no
20 yes 43 no female yes no no
Household_Size Geo_Location Ethnicity Education_Level
1 4 south cauc Bachelors_Degree
2 4 south cauc HighSchool_Diploma
3 5 west cauc ged
4 5 west cauc HighSchool_Diploma
5 5 west cauc none
6 3 south afam Bachelors_Degree
7 2 west cauc HighSchool_Diploma
8 2 south cauc HighSchool_Diploma
9 2 south cauc HighSchool_Diploma
10 2 northeast afam HighSchool_Diploma
11 2 northeast afam HighSchool_Diploma
12 1 midwest cauc Masters_Degree
13 3 south cauc phd
14 7 west cauc none
15 8 south cauc other
16 8 south cauc none
17 2 midwest cauc none
18 2 midwest cauc HighSchool_Diploma
19 1 midwest cauc HighSchool_Diploma
20 2 south cauc HighSchool_Diploma
> tail(New_Df, 20)
Healthy AGE Insurance_Limit Gender Qualified_Insurance Marital_Status Self_Employed
8783 yes 57 yes male yes yes no
8784 yes 33 no male yes yes no
8785 yes 32 no female yes yes yes
8786 yes 45 no male yes yes no
8787 yes 42 no female yes yes no
8788 yes 19 no male yes no no
8789 yes 47 no female no yes yes
8790 yes 22 no male no no no
8791 yes 40 no female yes no no
8792 no 33 no male yes yes no
8793 yes 50 no male yes no yes
8794 no 36 no male yes no no
8795 yes 34 no female yes no no
8796 yes 34 no male yes yes no
8797 yes 33 no female yes yes no
8798 yes 46 no female yes yes no
8799 yes 50 no male yes yes no
8800 yes 27 no male yes yes no
8801 yes 27 no female yes yes no
8802 yes 35 no male yes yes no
Household_Size Geo_Location Ethnicity Education_Level
8783 2 northeast cauc none
8784 2 midwest cauc Bachelors_Degree
8785 2 midwest cauc Masters_Degree
8786 5 northeast cauc other
8787 5 northeast cauc other
8788 5 northeast cauc HighSchool_Diploma
8789 4 south cauc Bachelors_Degree
8790 4 south cauc HighSchool_Diploma
8791 3 midwest afam HighSchool_Diploma
8792 6 south cauc HighSchool_Diploma
8793 1 midwest cauc HighSchool_Diploma
8794 1 northeast cauc HighSchool_Diploma
8795 2 south cauc HighSchool_Diploma
8796 4 midwest cauc HighSchool_Diploma
8797 4 midwest cauc HighSchool_Diploma
8798 3 northeast cauc HighSchool_Diploma
8799 3 northeast cauc HighSchool_Diploma
8800 2 south cauc Bachelors_Degree
8801 2 south cauc Bachelors_Degree
8802 4 northeast cauc phd
Question#7: BONUS –place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
HealthInsurance_DataSet <- read.csv('https://raw.githubusercontent.com/jakegeo/RBridge-Course-Work/main/HealthInsurance.csv')
HealthInsurance_DataSet
Question#7 Solution (Partial Solution shown for the first 20 entries of each attribute)
> HealthInsurance_DataSet <- read.csv('https://raw.githubusercontent.com/jakegeo/RBridge-Course-Work/main/HealthInsurance.csv')
> HealthInsurance_DataSet
X health age limit gender insurance married selfemp family region ethnicity education
1 1 yes 31 no male yes yes yes 4 south cauc bachelor
2 2 yes 31 no female yes yes no 4 south cauc highschool
3 3 yes 54 no male yes yes no 5 west cauc ged
4 4 yes 27 no male yes no no 5 west cauc highschool
5 5 yes 39 no male yes yes no 5 west cauc none
6 6 yes 32 no female no no no 3 south afam bachelor
7 7 no 56 yes female yes yes no 2 west cauc highschool
8 8 yes 60 no female yes yes no 2 south cauc highschool
9 9 yes 62 no male yes yes no 2 south cauc highschool
10 10 yes 52 no female no yes no 2 northeast afam highschool
11 11 yes 50 no male yes yes no 2 northeast afam highschool
12 12 yes 44 no male yes no no 1 midwest cauc master
13 13 yes 26 no female no yes yes 3 south cauc phd
14 14 yes 38 yes male no yes no 7 west cauc none
15 15 yes 48 no male yes yes no 8 south cauc other
16 16 yes 48 no female yes yes no 8 south cauc none
17 17 yes 53 no male no yes yes 2 midwest cauc none
18 18 yes 50 no female no yes no 2 midwest cauc highschool
19 19 yes 23 no female no no no 1 midwest cauc highschool
20 20 yes 43 no female yes no no 2 south cauc highschool