Question 1 : Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes of your data.
#Load package
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Load the data
Healthinsurancedata<- read_csv("WEEK2HOMEWORK/HealthInsurance.csv")
## New names:
## • `` -> `...1`
## Rows: 8802 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): health, limit, gender, insurance, married, selfemp, region, ethnici...
## dbl (3): ...1, age, family
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Use summary function to overview data
summary(Healthinsurancedata)
## ...1 health age limit
## Min. : 1 Length:8802 Min. :18.00 Length:8802
## 1st Qu.:2201 Class :character 1st Qu.:30.00 Class :character
## Median :4402 Mode :character Median :39.00 Mode :character
## Mean :4402 Mean :38.94
## 3rd Qu.:6602 3rd Qu.:48.00
## Max. :8802 Max. :62.00
## gender insurance married selfemp
## Length:8802 Length:8802 Length:8802 Length:8802
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## family region ethnicity education
## Min. : 1.000 Length:8802 Length:8802 Length:8802
## 1st Qu.: 2.000 Class :character Class :character Class :character
## Median : 3.000 Mode :character Mode :character Mode :character
## Mean : 3.094
## 3rd Qu.: 4.000
## Max. :14.000
#Displaying mean and median
mean_age<- mean(Healthinsurancedata$age)
median_age<- median(Healthinsurancedata$age)
mean_family <- mean(Healthinsurancedata$family)
median_family <- median(Healthinsurancedata$family)
#Printing the result
cat("Mean of Age is:", mean_age, "\n")
## Mean of Age is: 38.93683
cat("Median of Age is:", median_age, "\n")
## Median of Age is: 39
cat("Mean of family is", mean_family,"\n")
## Mean of family is 3.093501
cat("Median of family is:", median_family, "\n")
## Median of family is: 3
Question 2: Create a new data frame with a subset of the columns AND rows. There are several ways to do this so feel free to try a couple if you want. Make sure to rename the new data set so it simply just doesn’t write it over
new_df <- Healthinsurancedata %>%
select("health","age","gender","married", "family")%>%
slice(1:20)
#Print result
new_df
Question 3: Create new column names for each column in the new data frame created in step 2.
new_colnames <- c("Health_new","age_new","gender_new","married_new","family_new")
colnames(new_df) <- new_colnames
#Print result
new_df
Question 4 : Use the summary function to create an overview of your new data frame created in step 2. The print the mean and median for the same two attributes. Please compare (i.e. tell me how the values changed and why).
summary(new_df)
## Health_new age_new gender_new married_new
## Length:20 Min. :23.00 Length:20 Length:20
## Class :character 1st Qu.:31.75 Class :character Class :character
## Mode :character Median :46.00 Mode :character Mode :character
## Mean :43.35
## 3rd Qu.:52.25
## Max. :62.00
## family_new
## Min. :1.0
## 1st Qu.:2.0
## Median :2.5
## Mean :3.5
## 3rd Qu.:5.0
## Max. :8.0
#Displaying mean and median of new df
mean(new_df$age_new)
## [1] 43.35
median(new_df$age_new)
## [1] 46
mean(new_df$family_new)
## [1] 3.5
median(new_df$family_new)
## [1] 2.5
The values of mean and median for the new data frame is slightly higher than the original values because of the difference of size of the dataset
Question 5 : For at least 3 different/distinct values in a column please rename so that every value in that column is renamed. For example, change the letter “e” to “excellent”, the letter “a” to “average’ and the word “bad” to “terrible”.
new_df <- new_df %>%
mutate(family_new = case_when(
family_new == "4" ~ "family of 4",
family_new == "5" ~ "family of 5",
family_new == "3" ~ "family of 3",
))
#Printing
new_df
Question 6: Display enough rows to see examples of all of steps 1-5 above. This means use a function to show me enough row values that I can see the changes.
head(new_df, 6)
Question 7: Bonus Question
Healthinsurancedata_github <- read_csv("https://raw.githubusercontent.com/Ismoo225/RBRIDGE/main/HealthInsurance.csv")
## New names:
## Rows: 8802 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (9): health, limit, gender, insurance, married, selfemp, region, ethnici... dbl
## (3): ...1, age, family
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(Healthinsurancedata_github,6)